stkb / Rewrap

Rewrap extension for VSCode and Visual Studio
https://marketplace.visualstudio.com/items/stkb.rewrap
Other
512 stars 65 forks source link

Extension should never convert between spaces and tabs and always keep the exact line prefix #312

Open CodingMarkus opened 2 years ago

CodingMarkus commented 2 years ago

If I use spaces for indention, spaces are preserved. E.g. if I have this line:

#[space][space][space]Text Text Text...

The next line after a break will start with:

#[space][space][space]

Despite the fact that my entire document uses tabs for indention and tab size is set to 4 spaces (and not 3 as in the example above). Yet I would still consider the behavior to be correct as shown above.

If I use tabs for indention, tabs are preserved. E.g. if I have this line:

#[tab]Text Text Text...

The next line after a break will start with:

#[tab]

Again, this seems to correct to me.

However, if I have the following comment:

#[space]HEADLINE
#[tab]Text Text Text ...
#[tab]Text Text Text ...

And keep typing beyond the border, than tabs are converted to spaces the moment the comment is re-wrapped and all of a sudden I have:

#[space]HEADLINE
#[space][space][space]Text Text Text ...
#[space][space][space]Text Text Text ...
#[space][space][space]

IMHO the extension should never convert between tabs and spaces! If the current line is intended by tabs, new lines it generates should be intended by tabs as well. The fact that further above there is a line intended by space should not have any relevance.

#[A]Text
#[B]Text
#[C]Text
#[D]Text

All text lines, where the indention is exactly the same, should be considered as a block. So if B == C == D, then these three lines belong to the same block but if A != B, then this line has nothing to do with the current block and is a block of its own, possibly belonging to a block of lines above A with the same indention.

Even mixing space and tabs should be okay, as in this case:

#[space]SAMPLE CODE
#[space]Some descriptive text:
#[space][tab]Code code code....

If the third line gets broken, the next line should start with #[space][tab] and nothing else IMHO.

stkb commented 2 years ago

Ok I can explain what's happening here.

A comment block is divided into two columns. The left column is we'll call "comment prefix" and the right is the content. For any new line generated, the comment prefix of the previous line is copied. The comment prefix consists of:

<possible whitespace> + <comment marker (#)> + <more optional whitespace>

For the whitespace the left of the comment marker, it's preserved exactly (tabs and/or spaces). What happens to the right of the comment marker is more complicated.

It looks at all the all the lines and finds the one where the text is the least indented. It then splits the block at this column, so that for all lines, everything on the left is "comment prefix" and everything on the right is "comment content", which can be processed with its own relative indents.

So for example this is why, when you have this: (--→ = tab)

#--→Text...
#--→Text...

It gets split like this

(split after 4th column)
prefix content
#--→  ¦  Text...
#--→  ¦  Text...

And then the tabs are preserved because they're counted as part of the prefix.

For your example with the headline this is what is happening:

#·HEADLINE
#--→Text Text Text ...
#--→Text Text Text ...
  1. The least indented line is the headline, where the content begins after the 2nd column (for the Text lines, content begins after the 4th column)
  2. All lines are split at this point (after the 2nd column). Because the Text lines have tabs here, the tab needs to be replaced with a space so the prefix has the right width (2 columns)
#·  ¦  HEADLINE
#·  ¦  -→Text Text Text ...
#·  ¦  -→Text Text Text ...

Now, for the content, all tabs are converted into spaces before it's processed as markdown*. I took this decision early on in the extension's life, because it's much more difficult (though not impossible) to have to take tabs into account everywhere rather than spaces, and because it was in line with the commonmark spec at the time, that tabs should be converted to spaces (the only difference that they'd be converted according to the user's tab width rather than the fixed 4-space stops in the spec).

#·  ¦  HEADLINE
#·  ¦  ··Text Text Text ...
#·  ¦  ··Text Text Text ...

This is why you're seeing the results you are.


In the course of writing this explanation I have however had the idea that it could possibly convert indent spaces back into tabs as a final step, if the user has the document set as tabs, or restore the original indent characters.

Just as a note, if you do something like your last example

#[space]SAMPLE CODE
#[space]Some descriptive text:
#
#[space][tab]Code code code....

Then the "Code" line won't be counted as a code block, since the [tab] only counts as 2 spaces, not the required 4.


* Currently most comments are processed as markdown, but I plan on having this user-configurable in the near future.

CodingMarkus commented 2 years ago

Let me just quickly point out why I use tabs for indention within source code and comments in the first place.

There are two major arguments that speak for tabs:

  1. If you use tabs for indention, you can clearly separate indention from any other kind of space. If there is just a couple of spaces somewhere in a document, I cannot distinguish if these are just spaces (having no meaning) or supposed to be indention (special space that has a meaning). Tabs resolve this ambiguity. One tab in front, indent by one level. Two tabs in front, indent by two levels. 4 spaces in front, indent by one level. 8 spaces in front, indent by two levels. 6 spaces in front?... ahhh ... 🤷‍♂️ 5 spaces in front? Maybe 4 spaces with an accidental stray space at the end? There's a lot of ambiguity if n-spaces mean one indention as n varies from file to file and nobody stops you from not using an integral multiple of n, which gets very confusing. "One tab equals one indention" is unambiguous, though. And it also resolves the problem that the editor has to guess how many spaces are one indention level, as there is no standard way to store that information within a source file itself. Also the editor can never guess that for a file that has no indention so far at all.

  2. When showing source code to other people, I sometimes use a big screen, sometimes a small one, sometimes a projector and sometimes a SmartPhone. By default I configure tabs to be 4 spaces long, however, this may make the indention way too big or not obvious enough depending on screen size or distance of the viewer. In that case I can simply make a tab display as 2, 3, 5, or 6 spaces instead. This is easily possible with pretty much every editor. Even when showing source in a terminal window with less I can change the tab size easily. I cannot do that if I use spaces for indention; at least not everywhere and with every editor. And 4 spaces may be great for me when writing code but it's not great for everyone and not under all circumstances. Using tabs for indention makes indention size flexible and easy to change without touching the code at all or having to use some kind of special code editor.

That said, I do uses spaces for layouting. E.g. in the following source code, ---> is a tab and a . is a space:

--->someFunction(arg1, arg2, arg3,
--->.............arg4, arg5)

While this is not my style of writing code (it's for demonstration purposes only; personally I would raise indention in the 2nd line), if I want the arguments of the second line to align with the ones of the first one, of course I must use spaces, as even if I change the tab size, I still want them to align. Here space does not serve the purpose of indention but the purpose of alignment, which requires a fixed size. Both lines are still only indented a single level. You cannot align with tab as that would require tabs to have a fixed size that never changes and this just isn't the case. Yet that's no argument against using tabs for indention because indention doesn't require to have a fixed size, its purpose is to set indention level and that's very easy if all you need to do is counting leading tab characters.

In case of comments, it's partly the other way round. In a script I always leave a space between # and comment text, otherwise it would look like a Twitter #hashtag. So comments are like

# This is a multiline
# comment block.

However, for indention within comments, I still use tabs as when changing tab size, I want the indention to change, too. That's why I have comment blocks like this:

# SUMMARY
#   Check if a list of given commands can be safely called
#   from within a script.

If I now raise the tab size to 8, the same block will become:

# SUMMARY
#       Check if a list of given commands can be safely called
#       from within a script.

which makes the indention far more obvious to the reader. This comment format is loosely based upon ROBODoc.

Thus it would be great to have a non-Markdown mode, that pays no attention to the content of the comments and just groups lines by equal prefix. All comment lines with equal prefix belong to a group and if a line is broken, it gets exactly the same prefix as the line currently edited. So that this comment

 *   The very first line that needs rewrapping. 
 *   The second line following the first line.
 *     Another long line that needs rewrapping for sure.
 *     And even that is a long line that needs rewrapping
 * And here's the last line that needs rewrapping.

is broken as

 *   The very first line that needs 
 *   rewrapping. The second line 
 *   following the first line.
 *     Another long line that needs 
 *     rewrapping for sure. And even 
 *     that is a long line that
 *     needs rewrapping
 * And here's the last line that
 * needs rewrapping.

while preserving any whitespace before and after the asterisk exactly as it is

ioquatix commented 2 years ago

Thanks for creating this issue and the related discussion.

I came to realise after reading this, that hard wrapping at 80 characters is kind of impossible in the presence of tabs. Because tab width is flexible.

While I too would like this feature, I don't know how you hard wrap text with tabs without assuming tabs have a specific width, which is the anti-thesis to tabs in the first place :p

In any case, I also don't want my tabs converted to spaces. At best, I guess you can assume the tab width of the editor at the time the conversion is done, for the sake of computing hard wrapping.

Pxtl commented 11 months ago

Yes. I try to use tabs instead of spaces because I have some co-workers with poor vision and I've read that allowing them to customize their tab-size is helpful (and cannot be done with spaces). Rewrap swapping out tabs for spaces when not even modifying the indent is disappointing.

Even the most trivial case it happens.

For example, here's the header for a simple powershell script that wraps VSTest.console.exe:

<#
.SYNOPSIS
tab⟶run tests within the given vstest dll using vstest.console.exe
.NOTES
tab⟶reads the path to vstest.console.exe from Build\BuildConfig.json
#>

this gets converted into

<#
.SYNOPSIS
␠␠␠␠run tests within the given vstest dll using vstest.console.exe
.NOTES
␠␠␠␠reads the path to vstest.console.exe from Build\BuildConfig.json
#>

even when no change in wrapping is happening at all.

Ideally the rewrap should be leveraging the existing indentation chars of existing lines to indent.

Failing that, it should be respecting the editor preference. If the editor is set to use tabs and treat them as 4 spaces wide, use tabs and consider them to be 4 spaces wide for the purposes of wrapping.

image