No explicit control over CR LF line ending

koiuo commented 2 years ago

Helix editor recently added a feature to reflow a piece of text; helix uses textwrap for this

There's an issue, where reflow does not work correctly if CRLF sequence (\r\n) is used as a line terminator https://github.com/helix-editor/helix/issues/2645

helix does not support CR ending (pre macosx), but I guess, the issue would manifest with it as well.

As far as I can see, textwrap does not account for different line ending, there's no configuration option with which we could tune textwrap behavior.

Did I miss anything? Is there a way to support CRLF in textwrap?

If not, I'd be willing to contribute a fix. My initial idea is to introduce another option into the Options struct, named line_ending (or line_terminator as a fancier alternative). To keep things tight, I'd express it as enum (we can always create something like Custom should we want to support an arbitrary byte sequence later, but I highly doubt that). How does that sound?

mgeisler commented 2 years ago

Hey @edio, thanks for opening the issue. Super exciting that textwrap is used in the editor!

You're not missing anything, textwrap is hard-coded to split on \n in textwrap::wrap. I see you're using textwrap::refill and that is also hard-coded to use \n.

I've been thinking of \n as the logical way to separate lines and \r\n as more of a fileformat encoding of the logical \n characters. That is, my opinion is that programs should use \n throughout their processing and then encode the string to the target file format (e.g., do a .replace("\n", "\r\n") call at the point where the data leaves the program).

Searching a bit, it seems that this way of thinking comes from how languages like C++ handle text mode files. Rust doesn't have a text mode, so I think my way of thinking is outdated :-D

Now, I see a few different approaches for handling this:

We could either push the problem back to the caller: demand that you use \n delimited string with Textwrap. That matches my way of thinking above. Easiest solution to implement for Textwrap :smile:
Alternatively, we could split on a custom line ending as you suggest.
Finally, we could perhaps also use .lines() internally and ask people to use textwrap::wrap (which return a Vec<&str>) instead of textwrap::fill if they want to use custom line endings.

Out of the options, I think the first two are the best. Since Textwrap tries to be useful for many use-cases, I will be happy to see a custom enum like you suggest. I agree that two enum variants ought to be enough for now. Since we split the input just once, I think the performance impact should be negligible, but please compare

% cargo criterion fill_optimal_fit_ascii/4800

before and after to make sure.

mgeisler commented 2 years ago

Another thing: I think Helix is the first place which uses textwrap::refill for real. So if you run into problems with it, then please let me know. Most of the logic is in textwrap::unfill and it could use some real-world feedback since it's quite fuzzy what it should do on different input.

koiuo commented 2 years ago

@mgeisler , thanks for the response. I do not represent the helix dev team. I just had a couple of free evenings and wanted to find an interesting task to work on.

I considered transforming the input before feeding it to textwrap, but that seemed inefficient (and most of all cumbersome).

.lines() + ~textwrap::fill~textwrap:wrap seems like a nice and clean workaround in terms of code. ~That still leaves us with the default \n in the output, however (so it is still cumbersome, even if reduced by half 😁)~

Perhaps I should summon @kirawi here, who's been handling the issue in the helix repo and, I assume, familiar with the helix codebase (unlike me). @kirawi, please read Martin's comment https://github.com/mgeisler/textwrap/issues/453#issuecomment-1146168714 and share your opinion about the proposed solutions. Thank you!

EDIT: I misread. The proposal was to use lines() + textwrap::wrap. Works better for the client, but introduces a slight behavoir inconsistency for this lib: it handles any line ending in the input, but only outputs \n. Not sure if it's a problem.

mgeisler commented 2 years ago

Works better for the client, but introduces a slight behavoir inconsistency for this lib: it handles any line ending in the input, but only outputs \n. Not sure if it's a problem.

My thinking was that the client (the caller of wrap) would use \n or \r\n as desired to put the lines back together. Perhaps the caller would use neither: if the lines are meant to be displayed in an editor (:wink:), then perhaps the lines are just sent one by one to the display engine.

Thanks for the pull request, I'll go take a look now!

koiuo commented 2 years ago

@mgeisler , I benchmarked wrap with lines() vs split('\n') on 4800 input, and consistently across multiple runs the lines() version is slower. lines() additionally removes \r in the end of the line after the split, so it being slower is expected. My laptop isn't tuned for benchmarking (variable CPU frequency, etc.), but the best case of split('\n') is 5% faster than the best case of lines().

IMO, if there's no explicit control over the line ending, lines() should be the default behavior. I'm not sure 5% performance gain is worth the reduced portability (esp. considering Rust targets Windows too), and preprocessing of input will kill the 5% gain if input has CRLF.

Anyway switching to lines() conflicts with the line_ending option. So either we do lines() and call it a day (this will solve the issue for helix), or we introduce the LineEnding.

If we do lines() now, we can still introduce the LineEnding later (in this sense it is a safe option). We'll just want to have a special LineEnding::Auto value. This would be the default, and a specific value like CRLF would give a slight performance boost.

mgeisler commented 2 years ago

@mgeisler , I benchmarked wrap with lines() vs split('\n') on 4800 input, and consistently across multiple runs the lines() version is slower. lines() additionally removes \r in the end of the line after the split, so it being slower is expected. My laptop isn't tuned for benchmarking (variable CPU frequency, etc.), but the best case of split('\n') is 5% faster than the best case of lines().

IMO, if there's no explicit control over the line ending, lines() should be the default behavior. I'm not sure 5% performance gain is worth the reduced portability (esp. considering Rust targets Windows too), and preprocessing of input will kill the 5% gain if input has CRLF.

Anyway switching to lines() conflicts with the line_ending option. So either we do lines() and call it a day (this will solve the issue for helix), or we introduce the LineEnding.

If we do lines() now, we can still introduce the LineEnding later (in this sense it is a safe option). We'll just want to have a special LineEnding::Auto value. This would be the default, and a specific value like CRLF would give a slight performance boost.

I like what you've done in https://github.com/mgeisler/textwrap/pull/454: let the caller of wrap and friends specify what line ending to use both for splitting the lines and, in the case of fill, to use for joining the generated lines.

Let us continue discussing there!

mgeisler commented 2 years ago

Fixed by #454, thanks!

Arnavion commented 1 year ago

This was a semver breaking change since textwrap::Options is an exhaustive struct with pub fields. Code like:

fn textwrap_options() -> textwrap::Options<'static> {
    textwrap::Options {
        width: textwrap::termwidth(),
        initial_indent: "    ",
        subsequent_indent: "    ",
        break_words: true,
        wrap_algorithm: textwrap::WrapAlgorithm::OptimalFit(Default::default()),
        word_separator: textwrap::WordSeparator::UnicodeBreakProperties,
        word_splitter: textwrap::WordSplitter::NoHyphenation,
    }
}

breaks when updating from v0.15.0 to v0.15.1

mgeisler commented 1 year ago

Hi @Arnavion, oh no, I didn't think of that... Thanks for noticing! I'll yank the release and redo the release on the weekend.

I'm curious through, do you have code like that? Why would you write it like that when Options come with a built-in builder?

Arnavion commented 1 year ago

I'm curious through, do you have code like that?

Yes. It failed to compile yesterday and that's why I started investigating and reached here.

Why would you write it like that when Options come with a built-in builder?

I wanted to initialize all the fields with values of my choice, so the builder would be more verbose than setting the fields directly. Also, having a struct with pub fields is an invitation to let users set those fields manually, ie it's an explicit signal that it's okay to set those fields.

mgeisler commented 1 year ago

an invitation to let users set those fields manually, ie it's an explicit signal that it's okay to set those fields.

Yeah, I get that — I made them public to make it easy to reassign a few fields when needed. I just didn't expect anybody to actually use the struct literal syntax itself.

Thanks for letting me know, I'll fix it as soon as I can.

Arnavion commented 1 year ago

Yeah, in that case you can make it #[non_exhaustive] to prevent this from happening in the future.

mgeisler commented 1 year ago

Thanks @Arnavion, I've yanked 0.15.1 and released 0.16.0 instead.

nox commented 1 year ago

Please unyank 0.15.1, clap 3.2.22 explicitly depends on it.

nox commented 1 year ago

https://github.com/clap-rs/clap/issues/4418

nox commented 1 year ago

0.15.0 should have been rereleased as 0.15.2 to make sure downstream users explicitly depending on 0.15.1 don't get stuck.

Arnavion commented 1 year ago

If any crate depends on ^0.15.1 explicitly, it would normally be doing so because it depends on API newly added in 0.15.1, so it wouldn't work with 0.15.0 released as 0.15.2. So doing what you suggest would be pointless in general.

That said, clap 3 seems to have updated to 0.15.1 without depending on the change in 0.15.1 ( https://github.com/clap-rs/clap/commit/2fd55076e65b14fa92b34161c44d5058a9966f69 ), so it would indeed work with 0.15.0 re-released as 0.15.2. Why it updated its dep for no reason is a mystery...

nox commented 1 year ago

If any crate depends on ^0.15.1 explicitly, it would normally be doing so because it depends on API newly added in 0.15.1, so it wouldn't work with 0.15.0 released as 0.15.2. So doing what you suggest would be pointless in general.

This is wrong. If I add a dependency to my system I always take the biggest version number, because then that enforces I don't rely on a bugfix without doing so explicitly, and it also circumvents issues with crates that don't believe in minor version bumps (e.g. serde). Anyway it's good practice to always rerelease a bigger version number whenever a version is yanked to restore whatever was before.

Arnavion commented 1 year ago

That's fine, but I hope you defensively bump every dep you have in every commit you make, no matter how unrelated it is. After all you might have added foo dep at its latest version last week, but your clean build today might be relying on a new bugfix it released yesterday. Otherwise you're not achieving the safety you desire :)

(The only way this can be solved for sure is with -Z minimal-versions.)

Anyway, this is not the place to argue about it, and I agree that there's no harm in releasing 0.15.2.

mgeisler commented 1 year ago

Hi @Arnavion and @nox Thanks for letting me know about this! I did not anticipate such a deadlock when I yanked the release... Let's discuss further fixes in #484.

mgeisler / textwrap

No explicit control over CR LF line ending #453