mgeisler / textwrap

An efficient and powerful Rust library for word wrapping text.
MIT License
467 stars 45 forks source link

Text is wrapped at small and inconsistent widths when attempting to wrap at a very large width. #416

Closed olson-sean-k closed 2 years ago

olson-sean-k commented 3 years ago

When wrapping text at a very large width (e.g., usize::MAX) via textwrap::wrap, it is instead wrapped at small and inconsistent widths. This code...

fn main() {
    let text = "Hello there! This is some English text. It should not be wrapped given the extents \
                below.";
    for line in textwrap::wrap(text, usize::MAX) {
        println!("{}", line);
    }
}

...prints this:

Hello there! This is some
English text. It should
not be wrapped given the extents below.

Note that the wrapped width is not consistent, as the word "English" could fit on the first line without exceeding the width of the last line. I would not expect the text to be wrapped here at all, since it never reaches a width of usize::MAX. I'm not sure at what large widths this behavior begins to manifest.

olson-sean-k commented 3 years ago

This line in the optimized-fit algorithm is suspicious. It subtracts the target width from the line width where the target width is derived from the width given to textwrap::wrap. IIUC, that's usize::MAX in this case, so underflow occurs. Indeed the bug does not occur when the smawk feature is disabled and the first-fit algorithm is used instead.

I believe that subtraction could underflow for inputs that aren't as pathological as usize::MAX. Perhaps a saturating operation should be used there instead?

mgeisler commented 2 years ago

Hi @olson-sean-k, thanks for the test case and the excellent analysis. I have a test cases for wrapping with the maximum line width, but perhaps I only checked the first-fit algorithm... I'll have to check tomorrow.

As for fixing this, I'm not sure exactly how, unfortunately. The algorithm does not lend itself well to saturated addition/subtraction (unfortunately!) and so I've been fighting with corner cases like this one for a while.

Basically, the original algorithm assumed infinite precision math (I ported it from Python) and this breaks down in Rust.

mgeisler commented 2 years ago

I found the test mentioned above and while it does use the optimal-fit algorithm, it only tested a very short string:

    #[test]
    fn max_width() {
        assert_eq!(wrap("foo bar", usize::max_value()), vec!["foo bar"]);
    }

That is of course rather unfortunate and something I will dig into.

olson-sean-k commented 2 years ago

Thanks for taking a look!

Basically, the original algorithm assumed infinite precision math (I ported it from Python) and this breaks down in Rust.

I'm not sure this would be ideal, but maybe the optimized-fit algorithm could use (or fall back to) something like num_bigint::BigUInt for intermediate computations. I believe num-bigint's integer types behave similarly to Python's integers and will grow dynamically as needed.

That may be overkill though and may decrease performance for common use cases that work properly today. If the bounds are known, maybe the algorithm could accept a fixed-sized width (e.g., u64 instead of usize) and use larger integers for intermediate computations (e.g., u128). u128 may use a software implementation on some targets, but I'm guessing that any performance impact would be smaller than using a big integer like BigUint. Since the APIs are already backed by conversions, it should be possible to keep things ergonomic despite using fixed-sized integers.

mgeisler commented 2 years ago

Yeah, you're right: we could go all-in and use a big-int crate of some sort... but I feel it's overkill as you suggest.

For "normal" terminal widths, the wrapping works fine from what I can tell, so I would rather limit the size of the input widths. I've been playing with that in the past, but I didn't finish it.

mgeisler commented 2 years ago

I've done more testing with a build where line widths are limited to u16 and where the optimal-fit computations are done with u64. The fuzz test seems to no longer crash with that setup!

I will have to clean it up a bit and try to understand the consequences of limiting the line width in this way, especially since I'm also trying to make the library work outside of the console (as the Wasm demo). See also #247 for related discussions.

mgeisler commented 2 years ago

I've been testing this a lot more and I believe the solution in #421 will work well: use f64 internally when computing the wrapping cost. With that change, I can no longer trigger overflows in my tests.