Closed Moxinilian closed 3 years ago
Hey @Moxinilian, I think that would be a cool feature to add :-)
You're right that the current code simply measures the displayed width of each character in terms of columns, e.g., asdf
takes up 4 columns. If you would instead measure the size in pt or px or anything else, the word breaking code should work pretty much unchanged (there's probably a few places where a space is inserted and where the width of that character is hard-coded to 1).
Switching the calculations to floating point numbers is probably also okay. It would still be exact computations for the case where letters are given a width equal to the number of columns they occupy.
Please give it a try and see how it works. I hope the code can remain simple and understandable :-D
On the matter of the API, I was thinking about making Wrapper take an optional closure taking a character and returning its width. Do you think it is reasonable? I was also thinking that computing the size of characters could potentially be expensive (especially for TrueType Font for example). Do you think it is a relevant issue? If so, do you think caching the size of characters has its place in the library, or should it instead be handled by the closure?
So I made a PR to be able to provide custom width for characters, and I am now considering support for floating points. But considering the nature of the change, it would force it to be breaking and therefore force on people the disadvantages of floating points. I could try to support both but that would obviously have an impact on the simplicity of the code. What do you think would be preferable?
Hi @Moxinilian and @robinkrahl, as mentioned in #128, I would still like to see textwrap used outside the context of the terminal. So some way of measuring the displayed width of non-console text seems like a great first step.
As mentioned in https://github.com/mgeisler/textwrap/pull/128#issuecomment-711286080, I could imagine expanding the WrapOptions
trait with a width(&str) -> T
method, which would measure a single word, or rather a grapheme, since I want to start using graphemes as the unit when breaking apart long words.
The implementation of WrapOptions
for usize
would then change to something along the lines of
impl WrapOptions<usize> for usize {
fn width(&self, word: &str) -> usize {
word.width()
}
}
Writing this makes me realize that there is already a width() -> usize
method on the WrapOptions
trait :smile: That's a bit silly, so we would need a better word for one of the two...
The Options
struct could be parameterized by T
and it could gain a width_provider
method which takes a closure which measures the width. This would move us back towards the old API where Wrapper<T>
was parameterized on the type of "word splitter" used.
An annoying consequence of that was that it was hard to create a program which would dynamically change concrete type of WordSplitter
used — you locked yourself into one kind of WordSplitter
when you first created your Wrapper
struct. The new Options
avoids this by instead holding a Box<dyn WordSplitter>
. That works because all word splitters have the same API, so I don't know if the same technique will work smoothly when the closures differ in return type?
I'm also currently working on completely rewriting the internals so that the code is much less tied to characters/graphemes. Instead, I'm trying to let the code operate in terms of Line
and Word
structs:
pub struct Word {
idx: usize, // Index into current line
len: usize, // Lenght in bytes
width: usize, // Width in columns, including trailing whitespace
}
#[derive(Debug, Copy, Clone, PartialEq)]
pub struct Line {
idx: usize, // Index into text
len: usize, // Length in bytes
width: usize, // Width in columns
}
The two width
fields could then be T
instead.
Sounds like a good approach!
I’m not quite sure about the parametrization. If you don’t want to have the length type as a type parameter, does that mean you’d perform all internal calculations with a f64
? This could be problematic because generally speaking, you can’t convert a usize
to a f64
. So using a type parameter (maybe with num-traits
, if required) might be the easier solution.
That works because all word splitters have the same API, so I don't know if the same technique will work smoothly when the closures differ in return type?
I don’t think that works. But a set_width_provider
method could accept a F: Fn(...) -> I, I: Into<f64>
, wrap it in a closure that performs the conversion and then store that closure in a Box<dyn Fn(...) -> f64>
(or whatever type textwrap
uses internally).
Also, I noticed that there is an additional problem besides supporting non-usize widths and custom width calculations: When dealing with rich text, the string width also depends on the formatting (font family, font style, font size, ...). There might be different styles within a paragraph. So there should be an option to wrap arbitrary types that implement a Wrappable
(?) trait (or, more precisely, a vector/slice/iterator of such elements). Do you think this could be realized within textwrap
?
I’m not quite sure about the parametrization. If you don’t want to have the length type as a type parameter, does that mean you’d perform all internal calculations with a
f64
? This could be problematic because generally speaking, you can’t convert ausize
to af64
.
Right, that would be troublesome. I think it's fine to have Options<T>
. The minor problem I was talking about is that you wouldn't be able to do something like this:
// Simple ASCII-only width provider, returns u16:
let options = Options::new(80).set_width_provider(|word| word.len() as u16);
...
// Fancy width provider, returns f64:
options.set_width_provider(|word| some_font_lib::measure_text(word).width());
since options
would be "locked" to Options<u16>
after the first line. A similar thing happened (happens) with the Wrapper<T: WordSplitter>
API — which I got rid of in #213. However, I don't think this is a real problem since I expect people to stick to one type for their width measurements.
So let's say it's fine to parameterize WrapOptions
. I think it could be an associated type which we can use in a method called, say, measure
:
pub trait WrapOptions {
type Width;
fn measure<'w>(&self, word: &'w str) -> Self::Width;
}
When dealing with rich text, the string width also depends on the formatting (font family, font style, font size, ...). There might be different styles within a paragraph. So there should be an option to wrap arbitrary types that implement a
Wrappable
(?) trait (or, more precisely, a vector/slice/iterator of such elements).
Yes, that's a great point! My experimental code is exactly trying to abstract away such things by only manipulating words and lines. Putting it behind a trait sounds like just right thing to do.
So far I'm structuring the code so that a first pass finds the words and a second pass creates lines as necessary. So the "words" could also come from somewhere external. I'll try to put it up as soon as possible.
Sounds good! Let me know if I can help implementing or testing.
since
options
would be "locked" toOptions<u16>
after the first line. A similar thing happened (happens) with theWrapper<T: WordSplitter>
API — which I got rid of in #213. However, I don't think this is a real problem since I expect people to stick to one type for their width measurements.
If you want to support this use case, it should be possible to add a method that swaps the width provider and keeps the rest of the options, like in this simplified version:
struct Options<T> {
width_provider: T,
other_options: usize,
}
impl<T> Options<T> {
fn with_width_provider<S>(self, width_provider: S) -> Options<S> {
Options {
width_provider,
other_options: self.other_options,
}
}
}
(Unfortunately, Rust does not recognize that width_provider
is the only field that depends on T
and does not let us use the update syntax.)
As mentioned in #128 (comment), I could imagine expanding the
WrapOptions
trait with awidth(&str) -> T
method, which would measure a single word, or rather a grapheme, since I want to start using graphemes as the unit when breaking apart long words.
I think it would be better to measure words instead of graphemes so that kerning can be taken into account.
Hi @robinkrahl, I've merged the rewrite into master
, please see #221.
The main change is that we're now wrapping abstract Fragment
s instead of iterating through a string char
by char
. Right now, each fragment has an usize
width, but I want to experiment with making this a type parameter so that one could have fragments with widths measured in f32
, u16,
and whatnot...
You'll find the new building blocks in src/core.rs
. The wrapping itself is in wrap_fragments
and it's very simple: it just accumulates fragments until the line is full. While the wrapping itself is using the Fragment
trait, the full wrap
function uses a Word
struct which has more methods — in particular, words know how to split themselves into smaller parts to facilitate hyphenation and to break words longer than the current line width.
Does this seem useful? I would be happy to move more functionality from Word
into Fragment
in order to make this whole thing more generic and reusable.
Thanks @mgeisler, and sorry that it took me so long to get back to you. The wrap_fragments method looks very useful (and I like the house construction example :)).
I’ve updated genpdf to use textwrap, see src/wrap.rs on the textwrap branch [0], and it’s working really well.
[0] https://git.sr.ht/~ireas/genpdf-rs/tree/textwrap/src/wrap.rs
Currently, I’m multiplying all widths (in mm) by 1000 to work around the usize limitation. Of course it would be great if I could use f64 (or even my Mm type) directly.
I have two suggestions regarding the API design:
Currently, I have to compute the segmentation for all words. It would be great if the Fragment trait had a split method instead, e. g.:
/// Tries to split this fragment into two parts so that the width of
/// first fragment is equals to or less than the given width.
fn split(&self, width: usize) -> (Self, Option
This would even improve precision if kerning occurs at a segment border.
For my use case, I’d rather wrap line by line instead of wrapping all lines at once. The text may not fit on one page (or text area, generally speaking), and I don’t have any information about the dimensions of the next area. Maybe the wrapping could be moved into a separate function that handles a single line?
/// Wraps the text to the given width and returns the first line
/// and the remainder.
fn wrap_fragment_line
This could then be used internally by wrap_fragments.
What do you think?
Hey @robinkrahl,
Wow, it's really cool to see that the simple trait seems to work okay! I'm glad you liked the house construction example :-)
Currently, I’m multiplying all widths (in mm) by 1000 to work around the usize limitation. Of course it would be great if I could use f64 (or even my Mm type) directly.
Yes, I agree — I'll definitely like to lift this limitation. I think textwrap only needs a type which implements Add
and PartialOrd
. We would also need a zero value for the type, but I think that's about it.
Currently, I have to compute the segmentation for all words. It would be great if the Fragment trait had a
split
method instead, e. g.:/// Tries to split this fragment into two parts so that the width of /// first fragment is equals to or less than the given width. fn split(&self, width: usize) -> (Self, Option<Self>);
Interesting idea... right now, textwrap goes through a few steps:
So step 1 and 2 don't use the line width and they don't even need to compute the widths of the words (fragments) they produce. Step 3 and 4 need to know the widths of the words and they both use the line width.
So when would the split
method above be called? The "pipeline" above is not set in stone, of course, it just seemed to be most flexible one I've come up with so far.
The pipeline also lends itself to more advanced wrapping algorithms, such as the one I'm introducing in #234. In that PR, I introduce an algorithm which needs all potential break points up-front — which is not the case for the existing "greedy" algorithm. Ideally, I would be able to support both so that the simple algorithm stays lean and fast, while the advanced algorithm remains feasible.
For my use case, I’d rather wrap line by line instead of wrapping all lines at once. The text may not fit on one page (or text area, generally speaking), and I don’t have any information about the dimensions of the next area. Maybe the wrapping could be moved into a separate function that handles a single line?
/// Wraps the text to the given width and returns the first line /// and the remainder. fn wrap_fragment_line<T: Fragment>( fragments: &[T], line_width: usize, ) -> (&[T], &[T]);
This could then be used internally by
wrap_fragments
.
Okay, and your library could then call wrap_fragment_line
repeatedly, each time supplying a different value for line_width
?
The current wrap_fragments
function takes a callback which maps line numbers to line widths:
pub fn wrap_fragments<T: Fragment, F: Fn(usize) -> usize>(
fragments: &[T],
line_widths: F,
) -> Vec<&[T]>
However, I guess using the function would require you to pre-compute the available line width for the upcoming lines — and it sounds like you don't have that information? The thinking was that if you know the distance between subsequent baselines and if you know where, say, a figure is placed, then you could compute that lines 1-4 has width 18 cm whereas lines 5-10 have a width of just 13 cm. It sounds kind of complicated, though!
Could you perhaps wrap the entire paragraph at once using the available information about the line width? You get, say, 10 lines back and you start adding them to the page. When you see that you have no more space, then you note how many words (fragments) you output and you then re-wrap the remaining fragments with the new line width for the second page. Yes, this would re-wrap some lines, but since it takes in the order of 100-200 microseconds to wrap even large paragraphs, this might be feasible?
So when would the
split
method above be called? The "pipeline" above is not set in stone, of course, it just seemed to be most flexible one I've come up with so far.
I was thinking of step 4, but taking into account that other algorithms need to know all possible line breaks, it’s probably best not to include such a method. The only real drawback is that kerning might be ignored, but that should not make a big difference.
Okay, and your library could then call
wrap_fragment_line
repeatedly, each time supplying a different value forline_width
?
Exactly. There are different reasons why it might be difficult to determine the line width in advance:
So I’m currently doing what you described – count the fragments that I already rendered and re-wrap the remaining fragments once the line width changes. Adding a separate wrap_fragment_line function seemed like an easy way to avoid unnecessary calculations. But I see that this will not be possible if textwrap uses a non-greedy wrapping algorithm.
I'll close this for now — as demonstrated by @robinkrahl, it is now somewhat possible to use the wrapping code with Fragment
s that are not just pure text.
The problems with overflow in the OptimalFit
code (#247) is something we can look at separately. We'll also work towards further untangling the code in wrap
so that we can extract more reusable parts and make it easy for programs to wrap any kind of Fragment
that they like.
Please take a look at #310 for an example of how one can use Textwrap to wrap text on a HTML canvas.
... it's nothing fancy, to be honest, and it uses the "trick" of multiplying the widths by a constant to map f64
to usize
in the fragments.
Hello, this is a feature request.
It could be really useful to be able to use this library outside of the context of the CLI. I'm thinking about Piston text, that provides rendered characters with their widths depending on font-size, TrueType Font, etc... I just browsed the source code, and I believe supporting it would only be a matter of allowing to insert custom sizes for characters, but I am not sure as I obviously do not grasp all of the big picture. It would also require to handle floating point numbers in some way, as maybe the size could not be an integer. If you believe it's a good idea and is easy enough, I could do it myself.
Thanks in advance!