Closed itsjunetime closed 1 year ago
Hey Ian, thanks a lot, I'm glad you got it working :smile:
I've only had a quick look and I agree with you that it looks a bit strange that the tab_width
setting spreads all the way into Word
... Perhaps an alternative would be to let the normal logic do its thing and then simply adjust the whitespace width on the words afterwards?
Let me go through and see if I can add some comments.
wrap
,wrap_columns
, andfill_inplace
all have tests, though, and should work just as expected.
Perhaps I'm blind, but I don't see new tests for wrap
? I think it would be useful to have tests which check the new behavior, including a doctest or two which help serve as documentation.
The new tests should probably be at both the wrap
level and at the WordSeparator
level. As far as I understand, we now treat '\t'
as whitespace, so "foo\tbar"
is now two words: it used to be a single word. This should be reflected in new tests.
There are some functions in the library that I couldn't find a way to add support for tab widths within (specifically,
unfill
, since you have no way to pass in options, andrefill
, since you can't pass in the options to unfill so you can't be certain it was actually unfilled correctly).
That is fine, unfill
should not take Options
— it attempt to find the Options
which is compatible with the text given and it will then return these options. I think it's completely fine if unfill
ignores the tab width: I think it would be unnecessarily complex to try and guess the correct tab width.
I guess you can pass in Options
to refill
and that it would then use the new tab width when wrapping the text?
I tried running the wrapping benchmark on your branch, and it now takes ~20% longer to wrap the text:
% cargo criterion fill_optimal_fit_ascii/4800
String lengths/fill_optimal_fit_ascii/4800
time: [257.09 us 260.30 us 264.18 us]
change: [+18.779% +19.435% +20.187%] (p = 0.00 < 0.05)
Performance has regressed.
This is a bit much for such a "small" feature. Do you see similar results on your machine?
I am seeing some implementation flaws in this, and have some thoughts that I'd like to get your opinion on.
So, I initially added the tab_width
parameter into the Word
struct so that it could be used within the break_apart
function (when calling ch_width
). However, now I realize that tab_width
will only really be used inside Word
to calculate the whitespace width (since tabs can only be contained within the whitespace and thus will not affect the behavior within break_apart
at all), and it feels kind of silly to use it for only that one thing. However, whitespace_width
doesn't take any parameters (and can't if we want to keep the API the same), so we have to only use members of the Word
struct to determine whitespace_width
.
I saw what you said about not including a tab_width
parameter on Word
, but instead calculating the whitespace width on the word after it's already been created, and I think I like this solution. My plan is, then, to replace the tab_width
parameter on Word
with a whitespace_width
parameter, which will be calculated within the map
on lines 1097 - 1100 of src/lib.rs. Then, we can prevent some unnecessary cpu usage when whitespace_width()
is called repeatedly (it won't have to be recalculated multiple times, and even though Word
will still contain one parameter whose only function is to help with whitespace_width
, it will be more directly applicable and save cpu usage.
I think I'll start implementing this, push some changes, and let you know when to re-review it :)
I saw what you said about not including a
tab_width
parameter onWord
, but instead calculating the whitespace width on the word after it's already been created, and I think I like this solution. My plan is, then, to replace thetab_width
parameter onWord
with awhitespace_width
parameter, which will be calculated within themap
on lines 1097 - 1100 of src/lib.rs
Yeah, I think that would simplify things. Perhaps it can be simplified even further by doing something like this
for line in text.split('\n') {
let words = options.word_separator.find_words(line).map(|word| {
word.whitespace_width += word.whitespace.count('\t') * options.tab_width;
word
});
let split_words = word_splitters::split_words(words, &options.word_splitter);
Hmm, it seems there isn't a str::count
method like I invented above :smile: Perhaps you would need to do word.whitespace.bytes().iter().filter(|b| == b"\t").count()
or similar.
In any case, I think we can find the width of the whitespace in an efficient way without having to re-measure things more than necessary.
This brings me to another point which we should consider: how does this impact what a "word separator" is? Until now, "words" are anything between one or more space characters. The AsciiSpace
word separator implements this. With your change, the name is at best misleading :smile:
I just added a little example program in #430 which allows us to see in detail what words the library finds. It would be great if you could rebase your branch on top of master
and force push the PR. Then it's easier for both of us to experiment.
Without this PR, \t
is seen as a word:
% cargo run -q --example debug-words --no-default-features $'foo\tbar \t baz'
word_separator = AsciiSpace
text = "foo\tbar \t baz"
words = [
Word {
word: "foo\tbar",
whitespace: " ",
penalty: "",
width: 7,
},
Word {
word: "\t",
whitespace: " ",
penalty: "",
width: 1,
},
Word {
word: "baz",
whitespace: "",
penalty: "",
width: 3,
},
]
After this PR, \t
is included in the whitespace (but it is not yet used to separate words):
% cargo run -q --example debug-words --no-default-features $'foo\tbar \t baz'
word_separator = AsciiSpace
text = "foo\tbar \t baz"
words = [
Word {
word: "foo\tbar",
whitespace: " ",
penalty: "",
width: 6,
tab_width: 0,
},
Word {
word: "",
whitespace: "\t ",
penalty: "",
width: 0,
tab_width: 0,
},
Word {
word: "baz",
whitespace: "",
penalty: "",
width: 3,
tab_width: 0,
},
]
This seems inconsistent and I'm not actually sure what the new "rules" are :smile:
The UnicodeBreakProperties
word separator should probably not change — it promises to implement the Unicode line breaking algorithm, so if it doesn't include \t
as a word separator, then we shouldn't include it either.
You make a good point there (that \t
should still be considered a word by itself when using AsciiSpace
separator), but then we would need to keep tab_width
as a parameter of Word
for circumstances when a tab is contained in the word (so when using AsciiSpace
), so that you can still modify the tab_width
when using AsciiSpace
as a separator. Do we then want to keep tab_width
AND add whitespace_width
as a parameter (for circumstances discussed above)? I feel like that's a kind of clunky.
Should we perhaps make two Word
structs (one for AsciiSpace
and one for UnicodeBreakProperties
) and have them both implement Fragment
? That feels like a better solution than having a single struct (Word
) that behaves differently when different structs create it (since AsciiSpace
would have to create a Word
that trims only spaces, and UnicodeBreakProperties
would create one that trims all unicode whitespace). It would require more code but I feel that it would make more sense.
Thanks for pointing out that words are now split differently, though; I'd like to keep that the same as before. I just think this needs more discussion before I try to rewrite it again.
Hi Ian,
but then we would need to keep
tab_width
as a parameter ofWord
for circumstances when a tab is contained in the word
You mean because we need to refer to tab_width
later? I think we only need to know about the tab width when we call break_words
: this is the function that (optionally) breaks a word longer than the line width into pieces fitting inside a line. So after
let words = options.word_separator.find_words(line);
let split_words = word_splitters::split_words(words, &options.word_splitter);
we have the words broken down into syllables: so a word like up-to-date
becomes up-
, to-
, date
when using the HyphenSplitter
. Each of these "words" is what want to wrap. If break_words
is true
, we run each word through core::break_words
, which simply makes sure that each one fits on a line. This is what breaks language
into langua
and ge
when the line width is 6 columns.
So I think we would need to
split_words
above.core::break_words
know about the tab width so it can generate words with the right width.This way, the handling of the tab_width
is pretty self-contained to wrap
and to break_words
— and if break_words
is false
, there is no overhead at all from having a tab_width
.
Should we perhaps make two
Word
structs (one forAsciiSpace
and one forUnicodeBreakProperties
) and have them both implementFragment
? That feels like a better solution than having a single struct (Word
) that behaves differently when different structs create it (sinceAsciiSpace
would have to create aWord
that trims only spaces, andUnicodeBreakProperties
would create one that trims all unicode whitespace). It would require more code but I feel that it would make more sense.
Having two structs for this is tricky in its own way: Word
is used in the return type for WordSeparator
and WrapAlgorithm
. I did that because those traits are what people can override to customize the wrapping of console text. The traits and the Word
struct are for wrapping text set in a fixed width font shown in a terminal, they're not trying to be super general.
The Fragment
trait tries to capture a generic "piece of text" which can be wrapped by the wrap_first_fit
and wrap_optimal_fit
functions. Looking at this again just now, I realize that Fragment
is actually only used inside the wrap_algorithms
module. Perhaps I should move it there to make this more clear...
Hello! Sorry to bug you @iandwelker, but are you still planning on working on this, or would you like someone else to take it from here? The current lack of tab support is blocking me from fixing https://github.com/helix-editor/helix/issues/3622, so I (or maybe another helix contributor) might possibly be interested in continuing work on this.
Hey @mtoohey31, I think it would be great if you would jump in and work on this!
Looking over https://github.com/helix-editor/helix/issues/3622, I think a first step would be to let unfill
treat a leading \t
character as part of the indentation. That will allow unfill
and refill
kind-of work with tab-indentation:
modified src/lib.rs
@@ -653,7 +653,7 @@ where
pub fn unfill(text: &str) -> (String, Options<'_>) {
let line_ending_pat: &[_] = &['\r', '\n'];
let trimmed = text.trim_end_matches(line_ending_pat);
- let prefix_chars: &[_] = &[' ', '-', '+', '*', '>', '#', '/'];
+ let prefix_chars: &[_] = &[' ', '\t', '-', '+', '*', '>', '#', '/'];
let mut options = Options::new(0);
for (idx, line) in trimmed.lines().enumerate() {
@@ -1814,6 +1814,14 @@ mod tests {
assert_eq!(options.initial_indent, " ");
}
+ #[test]
+ fn unfill_tab_indent() {
+ let (text, options) = unfill("\t\tfoo\n\t\tbar\n\t\tbaz\n");
+ assert_eq!(text, "foo bar baz\n");
+ assert_eq!(options.width, 3);
+ assert_eq!(options.initial_indent, "\t\t");
+ }
+
#[test]
fn unfill_differing_indents() {
let (text, options) = unfill(" foo\n bar\n baz");
The \t
character is still seen as a control character with a zero width with this patch. That is apparently how it's defined in the relevant Unicode standard, as seen in this playground examle.
If you only want to support tab-indentation (to start with), then you could work around this by counting \t
in the indentation and adjusting the width according to the desired tab width.
If you want to wrap at 80 columns, then you would do something like this:
let width = 80;
let tab_width = 4;
let (text, mut options) = unfill(paragraph);
options.width = width - tab_width * options.initial_indent.bytes().filter(|ch| *ch == b'\t').count();
let wrapped = fill(text,
You would need to decide what you want to happen if initial_indent
differs from subsequent_indent
.
A more correct fix would be to make display_width
know about a tab width like the patches by @iandwelker does. That's much more invasive, though, and I'm not sure what the best way to do this is.
Hey! Sorry I haven't been responding or working on this, my need for it has significantly decreased in the past months so it sort of got put on the backburner in my mind. If someone else would like to take a shot at it, feel free to work on top of this PR or make a whole new PR for it; I don't think I'll be putting any time into solving this issue anytime soon, so you won't be stepping on my toes.
No worries @iandwelker, and thanks for the work you've done on this so far! I'll try and make progress on this as soon as I find time.
And thanks @mgeisler, I'll start with unfill
as you've suggested. I would like to allow tab width to be configured too, but as you say it looks like that'll be kinda tricky. I'll try to keep the commits separate in case we end up deciding that some of the changes aren't worth it.
If you only want to support tab-indentation (to start with), then you could work around this by counting
\t
in the indentation and adjusting the width according to the desired tab width.
That sounds like it might be a good compromise; it would handle all the situations that I've found annyoing with helix. I'll probably still see what I can come up with in terms of a more complete solution, but if it doesn't work out, then I'll go with indentation only.
And thanks @mgeisler, I'll start with
unfill
as you've suggested. I would like to allow tab width to be configured too, but as you say it looks like that'll be kinda tricky. I'll try to keep the commits separate in case we end up deciding that some of the changes aren't worth it.
I think it would be nice to make unfill
configurable somehow. We could perhaps create a struct for it and let refill
take the same struct (or it could use some defaults instead). Basically, I don't expect refill
to be able to cover all use-cases so I suggest keeping it simple. When more control is needed, it should be trivial to re-implement refill
and adjust this implementation to do what you want.
So I've spent some time reading through the discussion here, looking at the existing code, and doing some of my own tinkering. I'm less enthusiastic now about the option of only considering tab indentation, because when initial_indent
and subsequent_indent
are different, I don't think there's really any way to do the "right thing" without a more proper fix.
As discussed though, if we want to fix this properly, there's quite a few functions that we'll need to pass the tab width down through. Here are some of the potential options:
tab_width
parameters where necessary. I don't think this is really an option though, because we don't want to break the public API, right?tab_width
parameter of 0 to avoid code duplication. It looks like this is what the existing changes do. This wouldn't break the public API, but it will add a bunch more functions. I worry that this approach might also have an negative performance impact, since there's an extra case to consider when evaluating the width of every character, regardless of whether the user of the crate actually cares about tab width.tab_width
parameters we want without breaking the existing public API, and there won't be any performance degredation for people who don't enable the feature. I like this option the most. One minor downside is that we may end up with quite a few variants of a couple functions whose definitions are already dependent on other features. For example, we could end up with 4 different defitions of ch_width
depending on the settings of the unicode-width
and tab-width
features. We might be able to cut down on code duplication by using cfg_if or something similar though. The only other downside I can think of is that having an extra feature to worry about does make things a little more complex for users of this crate, but I think it's worth the trade-off compared to the other options.When you have time, could you let me know what you think @mgeisler, particularly about option 3?
Hi @mtoohey31, thanks for pushing on this feature!
In general, the public APIs of unfill
and refill
can be broken as needed and we'll then make a 0.16 release. That's not the end of the world, especially when the much more frequently used APIs of wrap
and fill
stay intact.
Thanks for writing up the three options, that makes it easier to discuss. I know the crate has a bunch of Cargo feature flags already, but I would actually prefer to have less of them overall. That is mostly because they become and extra layer to reason about: an if
statement in disguise. So I'm leaning away from option 3, but I might be wrong of course.
I think the changes in this PR are pretty good, but there are some remaining questions:
foo
and foo_with_tab
functions? Let's do the simplest change first, which is to make ch_width
take a tab_width
parameter. Perhaps add a benchmark just for ch_width
so we can understand what the impact is of that change?' '
? That is, are there 3 words in "foo\t\bar\tbaz"?
I think the PR does not separate words on TAB (sounds correct to me given the WordSeparator::AsciiSpace
), but the PR does treat a TAB as trailing whitespace on the Word
. I think that is wrong semantically: if we don't use TAB when finding words, then we should also not treat a TAB as whitespace when breaking lines.Word
need a tab_width
parameter? The PR gives them one so that words can be split further: should we instead pass down the tab_width
to the necessary functions so that they can recompute the word width after splitting. It seems to me that this might be better, but perhaps it's all the same.I would suggest that a good next step could be to clean up the PR: rebase it and remove all unrelated changes. Make it simple and let's see what that costs in terms of performance.
Does that sound reasonable?
It's definitely fair that adding another feature will make things more complicated. Now that I've thought about it again, I agree that benchmarking the impact of the simpler approach should be the next step. Right now I'm just guessing about what the performance might be like, which is never a good idea.
So yes, the next steps you've proposed sound good to me. I'll let you know when I've simplified the changes and have benchmarks for us to take a look at.
Thanks a lot!
As for how expensive this "should", it's hard to say... I have the impression that Textwrap is more than fast enough right now, as in nobody has ever complained about performance.
So I would say that code clarity should be our first priority 🙂 This will be the first time that the runtime Options
play a role all the way into the measurement of text. It will necessarily complicate things to some extent.
Ok, I finally got around to starting on this. I have a rebased version with the tests passing, and some things (though not everything) simplified here, and there's another branch that just contains the benchmark for comparison here.
The differences seem to be within margin of error. At commit 4663a6e on feat/tab-width
I get:
Running ch_width.rs (target/release/deps/ch_width-e036a9a20b4c044e)
ch_width time: [623.72 ns 624.55 ns 625.60 ns]
change: [-0.2085% +0.8424% +2.9066%] (p = 0.33 > 0.05)
No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
4 (4.00%) high mild
13 (13.00%) high severe
And on test/ch_width-benchmark
I get:
Running ch_width.rs (target/release/deps/ch_width-e036a9a20b4c044e)
ch_width time: [625.78 ns 626.30 ns 626.87 ns]
change: [-2.1724% -0.5367% +0.4230%] (p = 0.64 > 0.05)
No change in performance detected.
Found 16 outliers among 100 measurements (16.00%)
7 (7.00%) high mild
9 (9.00%) high severe
So I think that's good news?
Two test related notes/questions:
ch_width
public temporarily, since the benchmarks are in a separate crate. Do you want me to keep the benchmark around, or can we get rid of it now that we know the results? And if you do want me to keep it, where should I put it? Without making ch_width
public, I think the only place it can go is the main crate, or is there another solution?display_width
? In commit 0550427 I got rid of display_width_with_tab
by adding a tab_width
parameter directly to display_width
, but I'm not sure if that's an option... If it's not, I can drop that commit and we can move forward with two variants for that function. I haven't come up with any other way to do things other than adding a parameter to the original function, or using a _with_tab
variant.tab_width
inside of words. It definitely seems possible to pass tab_width
through the functions instead, but I think we should come to a decision on point two before I make this change, since I suspect that which functions will require tab_width
may depend on our answer there. I am somewhat worried that there will be quite a few functions that will need the added parameter, but we can cross that bridge when we come to it, and decide if it's worth it then.Ok, I finally got around to starting on this.
Yay, thank you for this! I just looked at the changes in your branch and it looks good.
Btw, I've just done some autumn-cleaning and moved the top-level functions out of lib.rs
: #487. I hope that won't cause too much trouble for you!
- I get those "Found n outliers" messages with existing tests too, regardless of whether I've closed other programs. I've also tried on a second machine, and I get similar output. Are those warnings expected?
I've seen them too once in a while! Don't worry about them, Criterion is supposed to handle it for us. In my experience, the benchmark numbers can fluctuate up and down by 5-10%, even without any code changes. I'm getting more stable results on a desktop vs a laptop. I mostly use the numbers to try and see if a change suddenly and unexpectedly makes things much slower (and to verify that things are O(n) when we expect them to be).
- I had to make
ch_width
public temporarily, since the benchmarks are in a separate crate. Do you want me to keep the benchmark around, or can we get rid of it now that we know the results? And if you do want me to keep it, where should I put it? Without makingch_width
public, I think the only place it can go is the main crate, or is there another solution?
Thanks for benchmarking it! You could perhaps change the benchmark to call the already public display_width
function?
- Do we want to avoid breaking the public API of
display_width
?
Yeah, I'm happy to release this as part of 0.17 release.
- I have some thoughts about points two and three in this comment of yours. Regarding the second question, I think that we should be separating words on tabs. They're whitespace, so I would personally expect them to be considered word boundaries, though it doesn't seem like the sort of thing someone will want to do very frequently. Does that sound right to you?
Yeah, that sounds good. I just tested with https://mgeisler.github.io/textwrap/ and this shows that copy-pasting a TAB into the text creates a word boundary when using WordSeparator::UnicodeBreakProperties
.
You could consider making this change as a follow-up: leave word separation as-is for now and only introduce the plumbing needed to pass down the tab_width
to the right places.
- Then regarding point three, I don't think we should be storing
tab_width
inside of words. It definitely seems possible to passtab_width
through the functions instead
Cool. My thinking was simply that the value will be constant for all words and thus it seems unnecessary to store it more than once.
Btw, I've just done some autumn-cleaning and moved the top-level functions out of
lib.rs
: #487. I hope that won't cause too much trouble for you!
No worries, it wasn't that much work!
You could perhaps change the benchmark to call the already public
display_width
function?
Sounds good, I've made that change in a1ff94e30a6556a1e072594a98a888b16438e3ad.
You could consider making this change as a follow-up: leave word separation as-is for now and only introduce the plumbing needed to pass down the
tab_width
to the right places.
That also sounds good. I believe I've covered all the places where tab_width
is needed in c6b90489d490ab678b170bb1433a74ca1982ab1c, but I'm not familiar with everything in the crate, so if you can think of any other others that might need it, let me know.
The other update that I've made is to remove the one remaining TODO comment in 484c29b6b2f2c0490a1bc0b138ec48b1ecef87fc. I'm pretty sure that using spaces is consistent with the behaviour of wrap
, since it removes trailing whitespace. Does that sound right to you?
Also, should I be opening a pull request from my branch now? I don't think there's any way for me to use this existing pull request, right?
Also, should I be opening a pull request from my branch now? I don't think there's any way for me to use this existing pull request, right?
Yeah, I think that is true. You would need write permission on @iandwelker's clone, but it's much easier to create a new PR. That will give us a fresh empty state to discuss from.
Thanks a lot to both @iandwelker for starting this and to you for finishing it!
That also sounds good. I believe I've covered all the places where
tab_width
is needed in c6b9048, but I'm not familiar with everything in the crate, so if you can think of any other others that might need it, let me know.
Thanks, I'll love to have that as a PR. Please squash all changes into a single clean commit on top of the latest master. That way we can talk about the changes as a unit instead of looking at the evolution of the changes.
Let's close this one now that we have #490.
This should add basic support for specifying a tab width within the
Options
struct, as discussed in #419. As it turns out, I had to add atab_width
field within bothOptions
andWord
, though that was not what was planned.This PR should include all extra documentation needed to support this feature, and also includes a few more tests with the areas of this library that I understand well enough to add tests for (and passes all tests as well).
There are some functions in the library that I couldn't find a way to add support for tab widths within (specifically,
unfill
, since you have no way to pass in options, andrefill
, since you can't pass in the options to unfill so you can't be certain it was actually unfilled correctly).wrap
,wrap_columns
, andfill_inplace
all have tests, though, and should work just as expected.I am very open to criticism and suggestions for how differently this should be implemented if anyone feels that this is not the best way. Also please ask any questions you have, I would understand if my reasoning for parts of this is somewhat confusing.