python-pillow / Pillow

Python Imaging Library (Fork)
https://python-pillow.org
Other
12.27k stars 2.23k forks source link

Automatic text wrapping/text box filling #6201

Open reticivis-net opened 2 years ago

reticivis-net commented 2 years ago

many people have written scripts to do this and it's relatively easy with the library's getsize func but I feel like it really should be a built-in feature.

i think there should be a draw_text_box or similar function which has these properties on top of the existing draw_multiline_text:

I'm not very experienced with the library internals or C in general so for now I won't make a pull, I just want to throw the idea out there to the devs

reticivis-net commented 2 years ago

also options to align text left, right, center, or justify would be useful as well. This script implements that

and top/middle/bottom text alignment. another example script

nulano commented 2 years ago

Sounds like a reasonable enhancement request. Note that there might be issues with mixed LTR/RTL text which will need extra tests.

A few notes:

with the library's getsize func

Please use the getlength function instead, the API of the getsize function is fundamentaly broken (especially with non-English text) and shouldn't be used for text layout purposes.

I think there should be an option to only break at word boundaries unless the word exceeds the max width, like the CSS word-break property

Word breaking is quite a difficult task, I'd suggest to constrain this to spaces to start with.

when reaching the max height, the font size is gradually reduced until the text fits inside

Not possible within the current API, a font is created at a given size and cannot be easily changed.

also options to align text left, right, center

Already possible using the align parameter, only justify is not yet supported. It would also require extra work in the new proposed function, so I see that as a separate request.

and top/middle/bottom text alignment

Already possible, use the anchor parameter with multiline text.

reticivis-net commented 2 years ago

Please use the getlength function instead, the API of the getsize function is fundamentaly broken (especially with non-English text) and shouldn't be used for text layout purposes.

oh, good to know. probably should add that to the docs?

Word breaking is quite a difficult task, it might be better to constrain this to spaces to start with.

right yeah I forgot CJK and other languages don't have definite characters at word boundaries, I just meant to break at whitespace which as I understand it shouldn't be too hard and might actually be faster than individual character breaking. i'd add support for zero-width spaces to allow some external library or native speaker to mark word boundaries for pillow assuming its a non-trivial task

Not possible within the current API, a font is created at a given size and cannot be easily changed.

ah so that's why the script I linked loads from file every change. is there an existing way to regenerate fonts without reloading from file or would that need to be an entire API change?

thanks for the quick and detailed response!

nulano commented 2 years ago

oh, good to know. probably should add that to the docs?

I think it will be deprecated soon, it's just a matter of working out the replacement (font.getsize_multiline doesn't have a clear replacement, that might be made easier by cleaning up the parameters as suggested in https://github.com/python-pillow/Pillow/pull/6195#discussion_r847410876). Discussed in #5816.

i'd add support for zero-width spaces

That part was just a suggestion to avoid overcomplicating things. Sure, zero-width spaces can probably be supported.

is there an existing way to regenerate fonts without reloading from file or would that need to be an entire API change?

I think you can load a font file in Python and then pass the bytes as input to ImageFont.truetype.

reticivis-net commented 2 years ago

I see. Thank you!

atomicparade commented 2 years ago

I've made a first attempt at implementing this, using a greedy algorithm:

https://github.com/atomicparade/pil_autowrap/blob/main/pil_autowrap/pil_autowrap.py#L73-L220

Example output here.

Issues:

Current blind spots and possible improvements:

nulano commented 2 years ago

I don't know how appropriate the results are for Arabic and Hebrew. Chinese, Japanese, and Korean text is not broken up properly.

If you are referring to this:

Certain characters in those languages should not come at the end of a line, certain characters should not come at the start of a line, and some characters should never be split up across two lines. For example, periods and closing parentheses are not allowed to start a line

then I would not worry about it. Similar rules exist in some European languages and even MS Word doesn't really help there.


I made the assumption that the line height is equal to the font size; however, looking at some of the generated images for Arabic and Hebrew, this doesn't appear to be the case. Maybe FreeTypeFont.getbbox would be more appropriate than FreeTypeFont.getlength?

The text height is calculated here: https://github.com/python-pillow/Pillow/blob/134023796e935ef79d5feb6879e9270327cfb8a2/src/PIL/ImageDraw.py#L514-L516 where spacing is a parameter defaulting to 4. This is not really accurate for some fonts, but it is used for historical reasons.

Do not use getbbox. That returns the height of the rendered text (which could be different for each line) and width of the rendered text (again, can be different with e.g. slanted text). It is not appropriate for text layout. Fonts generally don't exceed the line height and layout width they report, or only do so by a small amount when appropriate for stylistic reasons. (The height calculated above is not the actual line height reported by the font, but should be close enough in most cases).

reticivis-net commented 2 years ago

I feel that getting it working with “easier” languages first (ones that use white space or other characters to break words) would be the best thing to do right now as CJK word-breaking seems like a non-trivial task that could be hacked in by adding zero-width spaces. Is there an existing library that can determine word boundaries that could be included by PIL as an extra?

nulano commented 2 years ago

I may have misunderstood the Wikipedia article. The Unicode Line Breaking Algorithm is more helpful.

I think that it is probably sufficient to implement the non-tailorable part of the algorithm (see start of Table 1), which is just that line break characters are a mandatory break and spaces/zero-width spaces are an optional break. According to LineData.txt, this means it is sufficient to consider replacing the SPACE (U+20) and ZERO-WIDTH SPACE (U+200B) characters with "\n". The rest of the Unicode Line Breaking Algorithm would probably be best left to another library (e.g. by inserting zero-width spaces).

atomicparade commented 2 years ago

Is there an existing library that can determine word boundaries that could be included by PIL as an extra?

After a brief search, I couldn't find one that is freely available.

The Unicode Line Breaking Algorithm is more helpful.

I'm going to give this a shot! I'll start with Table 1 and leave all of the other character classes as break-allowed for now, though I think I'd like to try to implement the others as well.

reticivis-net commented 2 years ago

If you're going to implement the entire Unicode Line Breaking Algorithm, I recommend making it its own library If it's really complex or requires a table of characters or something, it could be specified as a PIL extra to not bloat PIL

nulano commented 2 years ago

If you want to implement the full algorithm, it might make sense to add it to Raqm (which Pillow uses internally), or make it a separate library that Raqm can use. See https://github.com/HOST-Oman/libraqm/issues/50

requires a table of characters or something

The LineData.txt from Unicode I linked above is the official list Unicode character line-breaking classes.

reticivis-net commented 2 years ago

I wasn’t familiar enough with PIL’s internals to suggest that but that is a good idea

nulano commented 2 years ago

Is there an existing library that can determine word boundaries that could be included by PIL as an extra?

After a brief search, I couldn't find one that is freely available.

The Raqm issue mentions https://github.com/adah1972/libunibreak. I haven't looked at it too closely, but it seems to be a C library implementing the Unicode algorithm that returns a list of valid break positions.

atomicparade commented 2 years ago

I am exploring adding a font_wraptext function (to start; would probably be nice to have a function to automatically determine an appropriate font [size] as well) to src/_imagingft.c and adding unibreak as a feature that depends on libunibreak being installed.

It does look like libunibreak maintains internal state (linebreak.c -> set_linebreaks_utf8), so I am not sure whether this will work well with multithreading. (Never mind! It doesn’t.)

Edit: Somehow I completely missed the part about adding this as a feature to libraqm itself. Hmm...

DinoSourcesRex commented 1 year ago

Any news on this? Would love this feature.

aclark4life commented 1 year ago

Looks like a stalled attempt at greatness, maybe someone can pick up the effort: https://github.com/atomicparade/pil_autowrap

quantumpotato commented 8 months ago

Bump for interest