mgeisler / textwrap

An efficient and powerful Rust library for word wrapping text.
MIT License
466 stars 45 forks source link

Allow the use of the library outside of the context of the terminal #126

Closed Moxinilian closed 3 years ago

Moxinilian commented 6 years ago

Hello, this is a feature request.

It could be really useful to be able to use this library outside of the context of the CLI. I'm thinking about Piston text, that provides rendered characters with their widths depending on font-size, TrueType Font, etc... I just browsed the source code, and I believe supporting it would only be a matter of allowing to insert custom sizes for characters, but I am not sure as I obviously do not grasp all of the big picture. It would also require to handle floating point numbers in some way, as maybe the size could not be an integer. If you believe it's a good idea and is easy enough, I could do it myself.

Thanks in advance!

mgeisler commented 6 years ago

Hey @Moxinilian, I think that would be a cool feature to add :-)

You're right that the current code simply measures the displayed width of each character in terms of columns, e.g., asdf takes up 4 columns. If you would instead measure the size in pt or px or anything else, the word breaking code should work pretty much unchanged (there's probably a few places where a space is inserted and where the width of that character is hard-coded to 1).

Switching the calculations to floating point numbers is probably also okay. It would still be exact computations for the case where letters are given a width equal to the number of columns they occupy.

Please give it a try and see how it works. I hope the code can remain simple and understandable :-D

Moxinilian commented 6 years ago

On the matter of the API, I was thinking about making Wrapper take an optional closure taking a character and returning its width. Do you think it is reasonable? I was also thinking that computing the size of characters could potentially be expensive (especially for TrueType Font for example). Do you think it is a relevant issue? If so, do you think caching the size of characters has its place in the library, or should it instead be handled by the closure?

Moxinilian commented 6 years ago

So I made a PR to be able to provide custom width for characters, and I am now considering support for floating points. But considering the nature of the change, it would force it to be breaking and therefore force on people the disadvantages of floating points. I could try to support both but that would obviously have an impact on the simplicity of the code. What do you think would be preferable?

mgeisler commented 4 years ago

Hi @Moxinilian and @robinkrahl, as mentioned in #128, I would still like to see textwrap used outside the context of the terminal. So some way of measuring the displayed width of non-console text seems like a great first step.

As mentioned in https://github.com/mgeisler/textwrap/pull/128#issuecomment-711286080, I could imagine expanding the WrapOptions trait with a width(&str) -> T method, which would measure a single word, or rather a grapheme, since I want to start using graphemes as the unit when breaking apart long words.

The implementation of WrapOptions for usize would then change to something along the lines of

impl WrapOptions<usize> for usize {
    fn width(&self, word: &str) -> usize {
        word.width()
    }
}

Writing this makes me realize that there is already a width() -> usize method on the WrapOptions trait :smile: That's a bit silly, so we would need a better word for one of the two...

The Options struct could be parameterized by T and it could gain a width_provider method which takes a closure which measures the width. This would move us back towards the old API where Wrapper<T> was parameterized on the type of "word splitter" used.

An annoying consequence of that was that it was hard to create a program which would dynamically change concrete type of WordSplitter used — you locked yourself into one kind of WordSplitter when you first created your Wrapper struct. The new Options avoids this by instead holding a Box<dyn WordSplitter>. That works because all word splitters have the same API, so I don't know if the same technique will work smoothly when the closures differ in return type?

mgeisler commented 4 years ago

I'm also currently working on completely rewriting the internals so that the code is much less tied to characters/graphemes. Instead, I'm trying to let the code operate in terms of Line and Word structs:

pub struct Word {
    idx: usize,   // Index into current line
    len: usize,   // Lenght in bytes
    width: usize, // Width in columns, including trailing whitespace
}

#[derive(Debug, Copy, Clone, PartialEq)]
pub struct Line {
    idx: usize,   // Index into text
    len: usize,   // Length in bytes
    width: usize, // Width in columns
}

The two width fields could then be T instead.

robinkrahl commented 4 years ago

Sounds like a good approach!

I’m not quite sure about the parametrization. If you don’t want to have the length type as a type parameter, does that mean you’d perform all internal calculations with a f64? This could be problematic because generally speaking, you can’t convert a usize to a f64. So using a type parameter (maybe with num-traits, if required) might be the easier solution.

That works because all word splitters have the same API, so I don't know if the same technique will work smoothly when the closures differ in return type?

I don’t think that works. But a set_width_provider method could accept a F: Fn(...) -> I, I: Into<f64>, wrap it in a closure that performs the conversion and then store that closure in a Box<dyn Fn(...) -> f64> (or whatever type textwrap uses internally).

Also, I noticed that there is an additional problem besides supporting non-usize widths and custom width calculations: When dealing with rich text, the string width also depends on the formatting (font family, font style, font size, ...). There might be different styles within a paragraph. So there should be an option to wrap arbitrary types that implement a Wrappable (?) trait (or, more precisely, a vector/slice/iterator of such elements). Do you think this could be realized within textwrap?

mgeisler commented 4 years ago

I’m not quite sure about the parametrization. If you don’t want to have the length type as a type parameter, does that mean you’d perform all internal calculations with a f64? This could be problematic because generally speaking, you can’t convert a usize to a f64.

Right, that would be troublesome. I think it's fine to have Options<T>. The minor problem I was talking about is that you wouldn't be able to do something like this:

// Simple ASCII-only width provider, returns u16:
let options = Options::new(80).set_width_provider(|word| word.len() as u16);
...
// Fancy width provider, returns f64:
options.set_width_provider(|word| some_font_lib::measure_text(word).width());

since options would be "locked" to Options<u16> after the first line. A similar thing happened (happens) with the Wrapper<T: WordSplitter> API — which I got rid of in #213. However, I don't think this is a real problem since I expect people to stick to one type for their width measurements.

So let's say it's fine to parameterize WrapOptions. I think it could be an associated type which we can use in a method called, say, measure:

pub trait WrapOptions {
    type Width;

    fn measure<'w>(&self, word: &'w str) -> Self::Width;
}

When dealing with rich text, the string width also depends on the formatting (font family, font style, font size, ...). There might be different styles within a paragraph. So there should be an option to wrap arbitrary types that implement a Wrappable (?) trait (or, more precisely, a vector/slice/iterator of such elements).

Yes, that's a great point! My experimental code is exactly trying to abstract away such things by only manipulating words and lines. Putting it behind a trait sounds like just right thing to do.

So far I'm structuring the code so that a first pass finds the words and a second pass creates lines as necessary. So the "words" could also come from somewhere external. I'll try to put it up as soon as possible.

robinkrahl commented 4 years ago

Sounds good! Let me know if I can help implementing or testing.

since options would be "locked" to Options<u16> after the first line. A similar thing happened (happens) with the Wrapper<T: WordSplitter> API — which I got rid of in #213. However, I don't think this is a real problem since I expect people to stick to one type for their width measurements.

If you want to support this use case, it should be possible to add a method that swaps the width provider and keeps the rest of the options, like in this simplified version:

struct Options<T> {
    width_provider: T,
    other_options: usize,
}

impl<T> Options<T> {
    fn with_width_provider<S>(self, width_provider: S) -> Options<S> {
        Options {
            width_provider,
            other_options: self.other_options,
        }
    }
}

(Unfortunately, Rust does not recognize that width_provider is the only field that depends on T and does not let us use the update syntax.)

robinkrahl commented 4 years ago

As mentioned in #128 (comment), I could imagine expanding the WrapOptions trait with a width(&str) -> T method, which would measure a single word, or rather a grapheme, since I want to start using graphemes as the unit when breaking apart long words.

I think it would be better to measure words instead of graphemes so that kerning can be taken into account.

mgeisler commented 3 years ago

Hi @robinkrahl, I've merged the rewrite into master, please see #221.

The main change is that we're now wrapping abstract Fragments instead of iterating through a string char by char. Right now, each fragment has an usize width, but I want to experiment with making this a type parameter so that one could have fragments with widths measured in f32, u16, and whatnot...

You'll find the new building blocks in src/core.rs. The wrapping itself is in wrap_fragments and it's very simple: it just accumulates fragments until the line is full. While the wrapping itself is using the Fragment trait, the full wrap function uses a Word struct which has more methods — in particular, words know how to split themselves into smaller parts to facilitate hyphenation and to break words longer than the current line width.

Does this seem useful? I would be happy to move more functionality from Word into Fragment in order to make this whole thing more generic and reusable.

robinkrahl commented 3 years ago

Thanks @mgeisler, and sorry that it took me so long to get back to you. The wrap_fragments method looks very useful (and I like the house construction example :)).

I’ve updated genpdf to use textwrap, see src/wrap.rs on the textwrap branch [0], and it’s working really well.

[0] https://git.sr.ht/~ireas/genpdf-rs/tree/textwrap/src/wrap.rs

Currently, I’m multiplying all widths (in mm) by 1000 to work around the usize limitation. Of course it would be great if I could use f64 (or even my Mm type) directly.

I have two suggestions regarding the API design:

What do you think?

mgeisler commented 3 years ago

Hey @robinkrahl,

Wow, it's really cool to see that the simple trait seems to work okay! I'm glad you liked the house construction example :-)

Currently, I’m multiplying all widths (in mm) by 1000 to work around the usize limitation. Of course it would be great if I could use f64 (or even my Mm type) directly.

Yes, I agree — I'll definitely like to lift this limitation. I think textwrap only needs a type which implements Add and PartialOrd. We would also need a zero value for the type, but I think that's about it.

  • Currently, I have to compute the segmentation for all words. It would be great if the Fragment trait had a split method instead, e. g.:

    /// Tries to split this fragment into two parts so that the width of
    /// first fragment is equals to or less than the given width.
    fn split(&self, width: usize) -> (Self, Option<Self>);

Interesting idea... right now, textwrap goes through a few steps:

  1. Find words: splitting on whitespace without regard for the line width.
  2. Split words according to hyphenation rules, without regard for the line width.
  3. Forcibly break words — here the line width is used.
  4. Wrap words by putting them into lines based on the line width and the measured widths of the words.

So step 1 and 2 don't use the line width and they don't even need to compute the widths of the words (fragments) they produce. Step 3 and 4 need to know the widths of the words and they both use the line width.

So when would the split method above be called? The "pipeline" above is not set in stone, of course, it just seemed to be most flexible one I've come up with so far.

The pipeline also lends itself to more advanced wrapping algorithms, such as the one I'm introducing in #234. In that PR, I introduce an algorithm which needs all potential break points up-front — which is not the case for the existing "greedy" algorithm. Ideally, I would be able to support both so that the simple algorithm stays lean and fast, while the advanced algorithm remains feasible.

  • For my use case, I’d rather wrap line by line instead of wrapping all lines at once. The text may not fit on one page (or text area, generally speaking), and I don’t have any information about the dimensions of the next area. Maybe the wrapping could be moved into a separate function that handles a single line?

    /// Wraps the text to the given width and returns the first line
    /// and the remainder.
    fn wrap_fragment_line<T: Fragment>(
       fragments: &[T],
       line_width: usize,
    ) -> (&[T], &[T]);

    This could then be used internally by wrap_fragments.

Okay, and your library could then call wrap_fragment_line repeatedly, each time supplying a different value for line_width?

The current wrap_fragments function takes a callback which maps line numbers to line widths:

pub fn wrap_fragments<T: Fragment, F: Fn(usize) -> usize>(
    fragments: &[T],
    line_widths: F,
) -> Vec<&[T]>

However, I guess using the function would require you to pre-compute the available line width for the upcoming lines — and it sounds like you don't have that information? The thinking was that if you know the distance between subsequent baselines and if you know where, say, a figure is placed, then you could compute that lines 1-4 has width 18 cm whereas lines 5-10 have a width of just 13 cm. It sounds kind of complicated, though!

Could you perhaps wrap the entire paragraph at once using the available information about the line width? You get, say, 10 lines back and you start adding them to the page. When you see that you have no more space, then you note how many words (fragments) you output and you then re-wrap the remaining fragments with the new line width for the second page. Yes, this would re-wrap some lines, but since it takes in the order of 100-200 microseconds to wrap even large paragraphs, this might be feasible?

robinkrahl commented 3 years ago

So when would the split method above be called? The "pipeline" above is not set in stone, of course, it just seemed to be most flexible one I've come up with so far.

I was thinking of step 4, but taking into account that other algorithms need to know all possible line breaks, it’s probably best not to include such a method. The only real drawback is that kerning might be ignored, but that should not make a big difference.

Okay, and your library could then call wrap_fragment_line repeatedly, each time supplying a different value for line_width?

Exactly. There are different reasons why it might be difficult to determine the line width in advance:

So I’m currently doing what you described – count the fragments that I already rendered and re-wrap the remaining fragments once the line width changes. Adding a separate wrap_fragment_line function seemed like an easy way to avoid unnecessary calculations. But I see that this will not be possible if textwrap uses a non-greedy wrapping algorithm.

mgeisler commented 3 years ago

I'll close this for now — as demonstrated by @robinkrahl, it is now somewhat possible to use the wrapping code with Fragments that are not just pure text.

The problems with overflow in the OptimalFit code (#247) is something we can look at separately. We'll also work towards further untangling the code in wrap so that we can extract more reusable parts and make it easy for programs to wrap any kind of Fragment that they like.

mgeisler commented 3 years ago

Please take a look at #310 for an example of how one can use Textwrap to wrap text on a HTML canvas.

mgeisler commented 3 years ago

... it's nothing fancy, to be honest, and it uses the "trick" of multiplying the widths by a constant to map f64 to usize in the fragments.