Tab width support - Githubissues

mgeisler / textwrap

An efficient and powerful Rust library for word wrapping text.

MIT License

449 stars 44 forks source link

Tab width support #429

Closed itsjunetime closed 1 year ago

itsjunetime commented 2 years ago

This should add basic support for specifying a tab width within the Options struct, as discussed in #419. As it turns out, I had to add a tab_width field within both Options and Word, though that was not what was planned.

This PR should include all extra documentation needed to support this feature, and also includes a few more tests with the areas of this library that I understand well enough to add tests for (and passes all tests as well).

There are some functions in the library that I couldn't find a way to add support for tab widths within (specifically, unfill, since you have no way to pass in options, and refill, since you can't pass in the options to unfill so you can't be certain it was actually unfilled correctly). wrap, wrap_columns, and fill_inplace all have tests, though, and should work just as expected.

I am very open to criticism and suggestions for how differently this should be implemented if anyone feels that this is not the best way. Also please ask any questions you have, I would understand if my reasoning for parts of this is somewhat confusing.

mgeisler commented 2 years ago

Hey Ian, thanks a lot, I'm glad you got it working :smile:

I've only had a quick look and I agree with you that it looks a bit strange that the tab_width setting spreads all the way into Word... Perhaps an alternative would be to let the normal logic do its thing and then simply adjust the whitespace width on the words afterwards?

Let me go through and see if I can add some comments.

mgeisler commented 2 years ago

wrap, wrap_columns, and fill_inplace all have tests, though, and should work just as expected.

Perhaps I'm blind, but I don't see new tests for wrap? I think it would be useful to have tests which check the new behavior, including a doctest or two which help serve as documentation.

The new tests should probably be at both the wrap level and at the WordSeparator level. As far as I understand, we now treat '\t' as whitespace, so "foo\tbar" is now two words: it used to be a single word. This should be reflected in new tests.

mgeisler commented 2 years ago

There are some functions in the library that I couldn't find a way to add support for tab widths within (specifically, unfill, since you have no way to pass in options, and refill, since you can't pass in the options to unfill so you can't be certain it was actually unfilled correctly).

That is fine, unfill should not take Options — it attempt to find the Options which is compatible with the text given and it will then return these options. I think it's completely fine if unfill ignores the tab width: I think it would be unnecessarily complex to try and guess the correct tab width.

I guess you can pass in Options to refill and that it would then use the new tab width when wrapping the text?

mgeisler commented 2 years ago

I tried running the wrapping benchmark on your branch, and it now takes ~20% longer to wrap the text:

% cargo criterion fill_optimal_fit_ascii/4800
String lengths/fill_optimal_fit_ascii/4800
                        time:   [257.09 us 260.30 us 264.18 us]
                        change: [+18.779% +19.435% +20.187%] (p = 0.00 < 0.05)
                        Performance has regressed.

This is a bit much for such a "small" feature. Do you see similar results on your machine?

itsjunetime commented 2 years ago

I am seeing some implementation flaws in this, and have some thoughts that I'd like to get your opinion on.

So, I initially added the tab_width parameter into the Word struct so that it could be used within the break_apart function (when calling ch_width). However, now I realize that tab_width will only really be used inside Word to calculate the whitespace width (since tabs can only be contained within the whitespace and thus will not affect the behavior within break_apart at all), and it feels kind of silly to use it for only that one thing. However, whitespace_width doesn't take any parameters (and can't if we want to keep the API the same), so we have to only use members of the Word struct to determine whitespace_width.

I saw what you said about not including a tab_width parameter on Word, but instead calculating the whitespace width on the word after it's already been created, and I think I like this solution. My plan is, then, to replace the tab_width parameter on Word with a whitespace_width parameter, which will be calculated within the map on lines 1097 - 1100 of src/lib.rs. Then, we can prevent some unnecessary cpu usage when whitespace_width() is called repeatedly (it won't have to be recalculated multiple times, and even though Word will still contain one parameter whose only function is to help with whitespace_width, it will be more directly applicable and save cpu usage.

I think I'll start implementing this, push some changes, and let you know when to re-review it :)

mgeisler commented 2 years ago

I saw what you said about not including a tab_width parameter on Word, but instead calculating the whitespace width on the word after it's already been created, and I think I like this solution. My plan is, then, to replace the tab_width parameter on Word with a whitespace_width parameter, which will be calculated within the map on lines 1097 - 1100 of src/lib.rs

Yeah, I think that would simplify things. Perhaps it can be simplified even further by doing something like this

    for line in text.split('\n') {
        let words = options.word_separator.find_words(line).map(|word| {
            word.whitespace_width += word.whitespace.count('\t') * options.tab_width;
            word
        });
        let split_words = word_splitters::split_words(words, &options.word_splitter);

Hmm, it seems there isn't a str::count method like I invented above :smile: Perhaps you would need to do word.whitespace.bytes().iter().filter(|b| == b"\t").count() or similar.

In any case, I think we can find the width of the whitespace in an efficient way without having to re-measure things more than necessary.

This brings me to another point which we should consider: how does this impact what a "word separator" is? Until now, "words" are anything between one or more space characters. The AsciiSpace word separator implements this. With your change, the name is at best misleading :smile:

I just added a little example program in #430 which allows us to see in detail what words the library finds. It would be great if you could rebase your branch on top of master and force push the PR. Then it's easier for both of us to experiment.

Without this PR, \t is seen as a word:

% cargo run -q --example debug-words --no-default-features $'foo\tbar \t baz'
word_separator = AsciiSpace
text = "foo\tbar \t baz"
words = [
    Word {
        word: "foo\tbar",
        whitespace: " ",
        penalty: "",
        width: 7,
    },
    Word {
        word: "\t",
        whitespace: " ",
        penalty: "",
        width: 1,
    },
    Word {
        word: "baz",
        whitespace: "",
        penalty: "",
        width: 3,
    },
]

After this PR, \t is included in the whitespace (but it is not yet used to separate words):

% cargo run -q --example debug-words --no-default-features $'foo\tbar \t baz'
word_separator = AsciiSpace
text = "foo\tbar \t baz"
words = [
    Word {
        word: "foo\tbar",
        whitespace: " ",
        penalty: "",
        width: 6,
        tab_width: 0,
    },
    Word {
        word: "",
        whitespace: "\t ",
        penalty: "",
        width: 0,
        tab_width: 0,
    },
    Word {
        word: "baz",
        whitespace: "",
        penalty: "",
        width: 3,
        tab_width: 0,
    },
]

This seems inconsistent and I'm not actually sure what the new "rules" are :smile:

The UnicodeBreakProperties word separator should probably not change — it promises to implement the Unicode line breaking algorithm, so if it doesn't include \t as a word separator, then we shouldn't include it either.

itsjunetime commented 2 years ago

You make a good point there (that \t should still be considered a word by itself when using AsciiSpace separator), but then we would need to keep tab_width as a parameter of Word for circumstances when a tab is contained in the word (so when using AsciiSpace), so that you can still modify the tab_width when using AsciiSpace as a separator. Do we then want to keep tab_width AND add whitespace_width as a parameter (for circumstances discussed above)? I feel like that's a kind of clunky.

Should we perhaps make two Word structs (one for AsciiSpace and one for UnicodeBreakProperties) and have them both implement Fragment? That feels like a better solution than having a single struct (Word) that behaves differently when different structs create it (since AsciiSpace would have to create a Word that trims only spaces, and UnicodeBreakProperties would create one that trims all unicode whitespace). It would require more code but I feel that it would make more sense.

Thanks for pointing out that words are now split differently, though; I'd like to keep that the same as before. I just think this needs more discussion before I try to rewrite it again.

mgeisler commented 2 years ago

Hi Ian,

but then we would need to keep tab_width as a parameter of Word for circumstances when a tab is contained in the word

You mean because we need to refer to tab_width later? I think we only need to know about the tab width when we call break_words: this is the function that (optionally) breaks a word longer than the line width into pieces fitting inside a line. So after

        let words = options.word_separator.find_words(line);
        let split_words = word_splitters::split_words(words, &options.word_splitter);

we have the words broken down into syllables: so a word like up-to-date becomes up-, to-, date when using the HyphenSplitter. Each of these "words" is what want to wrap. If break_words is true, we run each word through core::break_words, which simply makes sure that each one fits on a line. This is what breaks language into langua and ge when the line width is 6 columns.

So I think we would need to

Adjust the word widths in split_words above.
Let core::break_words know about the tab width so it can generate words with the right width.

This way, the handling of the tab_width is pretty self-contained to wrap and to break_words — and if break_words is false, there is no overhead at all from having a tab_width.

Should we perhaps make two Word structs (one for AsciiSpace and one for UnicodeBreakProperties) and have them both implement Fragment? That feels like a better solution than having a single struct (Word) that behaves differently when different structs create it (since AsciiSpace would have to create a Word that trims only spaces, and UnicodeBreakProperties would create one that trims all unicode whitespace). It would require more code but I feel that it would make more sense.

Having two structs for this is tricky in its own way: Word is used in the return type for WordSeparator and WrapAlgorithm. I did that because those traits are what people can override to customize the wrapping of console text. The traits and the Word struct are for wrapping text set in a fixed width font shown in a terminal, they're not trying to be super general.

The Fragment trait tries to capture a generic "piece of text" which can be wrapped by the wrap_first_fit and wrap_optimal_fit functions. Looking at this again just now, I realize that Fragment is actually only used inside the wrap_algorithms module. Perhaps I should move it there to make this more clear...

mtoohey31 commented 2 years ago

Hello! Sorry to bug you @iandwelker, but are you still planning on working on this, or would you like someone else to take it from here? The current lack of tab support is blocking me from fixing https://github.com/helix-editor/helix/issues/3622, so I (or maybe another helix contributor) might possibly be interested in continuing work on this.

mgeisler commented 2 years ago

Hey @mtoohey31, I think it would be great if you would jump in and work on this!

Looking over https://github.com/helix-editor/helix/issues/3622, I think a first step would be to let unfill treat a leading \t character as part of the indentation. That will allow unfill and refill kind-of work with tab-indentation:

modified   src/lib.rs
@@ -653,7 +653,7 @@ where
 pub fn unfill(text: &str) -> (String, Options<'_>) {
     let line_ending_pat: &[_] = &['\r', '\n'];
     let trimmed = text.trim_end_matches(line_ending_pat);
-    let prefix_chars: &[_] = &[' ', '-', '+', '*', '>', '#', '/'];
+    let prefix_chars: &[_] = &[' ', '\t', '-', '+', '*', '>', '#', '/'];

     let mut options = Options::new(0);
     for (idx, line) in trimmed.lines().enumerate() {
@@ -1814,6 +1814,14 @@ mod tests {
         assert_eq!(options.initial_indent, "  ");
     }

+    #[test]
+    fn unfill_tab_indent() {
+        let (text, options) = unfill("\t\tfoo\n\t\tbar\n\t\tbaz\n");
+        assert_eq!(text, "foo bar baz\n");
+        assert_eq!(options.width, 3);
+        assert_eq!(options.initial_indent, "\t\t");
+    }
+
     #[test]
     fn unfill_differing_indents() {
         let (text, options) = unfill("  foo\n    bar\n  baz");

The \t character is still seen as a control character with a zero width with this patch. That is apparently how it's defined in the relevant Unicode standard, as seen in this playground examle.

If you only want to support tab-indentation (to start with), then you could work around this by counting \t in the indentation and adjusting the width according to the desired tab width.

If you want to wrap at 80 columns, then you would do something like this:

let width = 80;
let tab_width = 4;
let (text, mut options) = unfill(paragraph);
options.width = width - tab_width * options.initial_indent.bytes().filter(|ch| *ch == b'\t').count();
let wrapped = fill(text,

You would need to decide what you want to happen if initial_indent differs from subsequent_indent.

A more correct fix would be to make display_width know about a tab width like the patches by @iandwelker does. That's much more invasive, though, and I'm not sure what the best way to do this is.

itsjunetime commented 2 years ago

Hey! Sorry I haven't been responding or working on this, my need for it has significantly decreased in the past months so it sort of got put on the backburner in my mind. If someone else would like to take a shot at it, feel free to work on top of this PR or make a whole new PR for it; I don't think I'll be putting any time into solving this issue anytime soon, so you won't be stepping on my toes.

mtoohey31 commented 2 years ago

No worries @iandwelker, and thanks for the work you've done on this so far! I'll try and make progress on this as soon as I find time.

And thanks @mgeisler, I'll start with unfill as you've suggested. I would like to allow tab width to be configured too, but as you say it looks like that'll be kinda tricky. I'll try to keep the commits separate in case we end up deciding that some of the changes aren't worth it.

If you only want to support tab-indentation (to start with), then you could work around this by counting \t in the indentation and adjusting the width according to the desired tab width.

That sounds like it might be a good compromise; it would handle all the situations that I've found annyoing with helix. I'll probably still see what I can come up with in terms of a more complete solution, but if it doesn't work out, then I'll go with indentation only.

mgeisler commented 2 years ago

And thanks @mgeisler, I'll start with unfill as you've suggested. I would like to allow tab width to be configured too, but as you say it looks like that'll be kinda tricky. I'll try to keep the commits separate in case we end up deciding that some of the changes aren't worth it.

I think it would be nice to make unfill configurable somehow. We could perhaps create a struct for it and let refill take the same struct (or it could use some defaults instead). Basically, I don't expect refill to be able to cover all use-cases so I suggest keeping it simple. When more control is needed, it should be trivial to re-implement refill and adjust this implementation to do what you want.

mtoohey31 commented 1 year ago

So I've spent some time reading through the discussion here, looking at the existing code, and doing some of my own tinkering. I'm less enthusiastic now about the option of only considering tab indentation, because when initial_indent and subsequent_indent are different, I don't think there's really any way to do the "right thing" without a more proper fix.

As discussed though, if we want to fix this properly, there's quite a few functions that we'll need to pass the tab width down through. Here are some of the potential options:

Just add tab_width parameters where necessary. I don't think this is really an option though, because we don't want to break the public API, right?
Create separate variants of functions that account for tab width, then just have the existing public functions use those with a tab_width parameter of 0 to avoid code duplication. It looks like this is what the existing changes do. This wouldn't break the public API, but it will add a bunch more functions. I worry that this approach might also have an negative performance impact, since there's an extra case to consider when evaluating the width of every character, regardless of whether the user of the crate actually cares about tab width.
Hide tab width support behind a feature that's disabled by default. Then we can add all the tab_width parameters we want without breaking the existing public API, and there won't be any performance degredation for people who don't enable the feature. I like this option the most. One minor downside is that we may end up with quite a few variants of a couple functions whose definitions are already dependent on other features. For example, we could end up with 4 different defitions of ch_width depending on the settings of the unicode-width and tab-width features. We might be able to cut down on code duplication by using cfg_if or something similar though. The only other downside I can think of is that having an extra feature to worry about does make things a little more complex for users of this crate, but I think it's worth the trade-off compared to the other options.

When you have time, could you let me know what you think @mgeisler, particularly about option 3?

mgeisler commented 1 year ago

Hi @mtoohey31, thanks for pushing on this feature!

In general, the public APIs of unfill and refill can be broken as needed and we'll then make a 0.16 release. That's not the end of the world, especially when the much more frequently used APIs of wrap and fill stay intact.

Thanks for writing up the three options, that makes it easier to discuss. I know the crate has a bunch of Cargo feature flags already, but I would actually prefer to have less of them overall. That is mostly because they become and extra layer to reason about: an if statement in disguise. So I'm leaning away from option 3, but I might be wrong of course.

I think the changes in this PR are pretty good, but there are some remaining questions:

Can we avoid the duplicated foo and foo_with_tab functions? Let's do the simplest change first, which is to make ch_width take a tab_width parameter. Perhaps add a benchmark just for ch_width so we can understand what the impact is of that change?
Do we find words separated with TAB as well as on ' '? That is, are there 3 words in "foo\t\bar\tbaz"? I think the PR does not separate words on TAB (sounds correct to me given the WordSeparator::AsciiSpace), but the PR does treat a TAB as trailing whitespace on the Word. I think that is wrong semantically: if we don't use TAB when finding words, then we should also not treat a TAB as whitespace when breaking lines.
Does every Word need a tab_width parameter? The PR gives them one so that words can be split further: should we instead pass down the tab_width to the necessary functions so that they can recompute the word width after splitting. It seems to me that this might be better, but perhaps it's all the same.

I would suggest that a good next step could be to clean up the PR: rebase it and remove all unrelated changes. Make it simple and let's see what that costs in terms of performance.

Does that sound reasonable?

mtoohey31 commented 1 year ago

It's definitely fair that adding another feature will make things more complicated. Now that I've thought about it again, I agree that benchmarking the impact of the simpler approach should be the next step. Right now I'm just guessing about what the performance might be like, which is never a good idea.

So yes, the next steps you've proposed sound good to me. I'll let you know when I've simplified the changes and have benchmarks for us to take a look at.

mgeisler commented 1 year ago

Thanks a lot!

As for how expensive this "should", it's hard to say... I have the impression that Textwrap is more than fast enough right now, as in nobody has ever complained about performance.

So I would say that code clarity should be our first priority 🙂 This will be the first time that the runtime Options play a role all the way into the measurement of text. It will necessarily complicate things to some extent.

mtoohey31 commented 1 year ago

Ok, I finally got around to starting on this. I have a rebased version with the tests passing, and some things (though not everything) simplified here, and there's another branch that just contains the benchmark for comparison here.

Benchmark Results

The differences seem to be within margin of error. At commit 4663a6e on feat/tab-width I get:

Running ch_width.rs (target/release/deps/ch_width-e036a9a20b4c044e)
ch_width                time:   [623.72 ns 624.55 ns 625.60 ns]
                        change: [-0.2085% +0.8424% +2.9066%] (p = 0.33 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  4 (4.00%) high mild
  13 (13.00%) high severe

And on test/ch_width-benchmark I get:

Running ch_width.rs (target/release/deps/ch_width-e036a9a20b4c044e)
ch_width                time:   [625.78 ns 626.30 ns 626.87 ns]
                        change: [-2.1724% -0.5367% +0.4230%] (p = 0.64 > 0.05)
                        No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
  7 (7.00%) high mild
  9 (9.00%) high severe

So I think that's good news?

Two test related notes/questions:

I get those "Found n outliers" messages with existing tests too, regardless of whether I've closed other programs. I've also tried on a second machine, and I get similar output. Are those warnings expected?
I had to make ch_width public temporarily, since the benchmarks are in a separate crate. Do you want me to keep the benchmark around, or can we get rid of it now that we know the results? And if you do want me to keep it, where should I put it? Without making ch_width public, I think the only place it can go is the main crate, or is there another solution?

Outstanding Implementation Questions

Do we want to avoid breaking the public API of display_width? In commit 0550427 I got rid of display_width_with_tab by adding a tab_width parameter directly to display_width, but I'm not sure if that's an option... If it's not, I can drop that commit and we can move forward with two variants for that function. I haven't come up with any other way to do things other than adding a parameter to the original function, or using a _with_tab variant.
Now that I've looked at the code a little more, I have some thoughts about points two and three in this comment of yours. Regarding the second question, I think that we should be separating words on tabs. They're whitespace, so I would personally expect them to be considered word boundaries, though it doesn't seem like the sort of thing someone will want to do very frequently. Does that sound right to you? Then regarding point three, I don't think we should be storing tab_width inside of words. It definitely seems possible to pass tab_width through the functions instead, but I think we should come to a decision on point two before I make this change, since I suspect that which functions will require tab_width may depend on our answer there. I am somewhat worried that there will be quite a few functions that will need the added parameter, but we can cross that bridge when we come to it, and decide if it's worth it then.

mgeisler commented 1 year ago

Ok, I finally got around to starting on this.

Yay, thank you for this! I just looked at the changes in your branch and it looks good.

Btw, I've just done some autumn-cleaning and moved the top-level functions out of lib.rs: #487. I hope that won't cause too much trouble for you!

I get those "Found n outliers" messages with existing tests too, regardless of whether I've closed other programs. I've also tried on a second machine, and I get similar output. Are those warnings expected?

I've seen them too once in a while! Don't worry about them, Criterion is supposed to handle it for us. In my experience, the benchmark numbers can fluctuate up and down by 5-10%, even without any code changes. I'm getting more stable results on a desktop vs a laptop. I mostly use the numbers to try and see if a change suddenly and unexpectedly makes things much slower (and to verify that things are O(n) when we expect them to be).

I had to make ch_width public temporarily, since the benchmarks are in a separate crate. Do you want me to keep the benchmark around, or can we get rid of it now that we know the results? And if you do want me to keep it, where should I put it? Without making ch_width public, I think the only place it can go is the main crate, or is there another solution?

Thanks for benchmarking it! You could perhaps change the benchmark to call the already public display_width function?

Do we want to avoid breaking the public API of display_width?

Yeah, I'm happy to release this as part of 0.17 release.

I have some thoughts about points two and three in this comment of yours. Regarding the second question, I think that we should be separating words on tabs. They're whitespace, so I would personally expect them to be considered word boundaries, though it doesn't seem like the sort of thing someone will want to do very frequently. Does that sound right to you?

Yeah, that sounds good. I just tested with https://mgeisler.github.io/textwrap/ and this shows that copy-pasting a TAB into the text creates a word boundary when using WordSeparator::UnicodeBreakProperties.

You could consider making this change as a follow-up: leave word separation as-is for now and only introduce the plumbing needed to pass down the tab_width to the right places.

Then regarding point three, I don't think we should be storing tab_width inside of words. It definitely seems possible to pass tab_width through the functions instead

Cool. My thinking was simply that the value will be constant for all words and thus it seems unnecessary to store it more than once.

mtoohey31 commented 1 year ago

Btw, I've just done some autumn-cleaning and moved the top-level functions out of lib.rs: #487. I hope that won't cause too much trouble for you!

No worries, it wasn't that much work!

You could perhaps change the benchmark to call the already public display_width function?

Sounds good, I've made that change in a1ff94e30a6556a1e072594a98a888b16438e3ad.

You could consider making this change as a follow-up: leave word separation as-is for now and only introduce the plumbing needed to pass down the tab_width to the right places.

That also sounds good. I believe I've covered all the places where tab_width is needed in c6b90489d490ab678b170bb1433a74ca1982ab1c, but I'm not familiar with everything in the crate, so if you can think of any other others that might need it, let me know.

The other update that I've made is to remove the one remaining TODO comment in 484c29b6b2f2c0490a1bc0b138ec48b1ecef87fc. I'm pretty sure that using spaces is consistent with the behaviour of wrap, since it removes trailing whitespace. Does that sound right to you?

Also, should I be opening a pull request from my branch now? I don't think there's any way for me to use this existing pull request, right?

mgeisler commented 1 year ago

Also, should I be opening a pull request from my branch now? I don't think there's any way for me to use this existing pull request, right?

Yeah, I think that is true. You would need write permission on @iandwelker's clone, but it's much easier to create a new PR. That will give us a fresh empty state to discuss from.

Thanks a lot to both @iandwelker for starting this and to you for finishing it!

That also sounds good. I believe I've covered all the places where tab_width is needed in c6b9048, but I'm not familiar with everything in the crate, so if you can think of any other others that might need it, let me know.

Thanks, I'll love to have that as a PR. Please squash all changes into a single clean commit on top of the latest master. That way we can talk about the changes as a unit instead of looking at the evolution of the changes.

mgeisler commented 1 year ago

Let's close this one now that we have #490.