Non-breaking space support in `multi_cell()` and `write_html()`

rysson commented 1 year ago

Error details

Incorrect support for non-breaking space (NBSP, hard-space, 0x00a0).

Non-breaking space prevents line breaking and it works fine.

Non-breaking space is not fixed-width at all. I know some old systems still use fixed-width NBSP but all modern browsers not.

NBSP (as Unicode character or   entity) should be variable width space (SpaceHint?) in the FPDF.write_html().

The same in the pdf.multi_cell(). It could be a compatibility problem if users use it as fixed-width space. Maybe extra option (fixed_width_nbsp=True)?

20230703-152836-821x60

Minimal code

from fpdf import FPDF

pdf = FPDF(orientation='P', format='A4')
pdf.add_page()
pdf.set_font('helvetica', size=10)
pdf.multi_cell(0, 5, '1\u00a02 aaaaaaaaaaaaaaaaaaaaaqweqwedqweqwe aaaaaaaaaaaaaaaaaaaaaqweqwedqweqweqweqweqweqweqweqwqweqw.</p>')

from fpdf import FPDF

pdf = FPDF(orientation='P', format='A4')
pdf.add_page()
pdf.set_font('helvetica', size=10)
pdf.write_html('<p align="justify">1&nbsp;2 aaaaaaaaaaaaaaaaaaaaaqweqwedqweqwe aaaaaaaaaaaaaaaaaaaaaqweqwedqweqweqweqweqweqweqweqwqweqw.</p>')

Environment Please provide the following information:

Operating System: Linux Debian Bookworm
Python version: 3.9.17, 3.11.4
fpdf2 version used: 2.7.4 (pip)

Also check on GitHub master branch.

rysson commented 1 year ago

Some hot-fix for both methods.
Without any extra arguments (to keep backward compatybility), I've found your awesome library today, I'm not able to make it in the correct way.

File line_break.py

HARD_SPACE = "\u00a0"

class Fragment:
    ...
    def add_character(...):
        ...
        if character == SPACE:
            ...
        elif character == HARD_SPACE:
            character = SPACE
            self.fragments.append(Fragment("", graphics_state, k, url))
            # changing active fragment is not necessary I guess,
            # space can be at end of the last fragment or first in the new fragment as well
            # active_fragment = self.fragments[-1]   # TODO: delete me
            self.number_of_spaces += 1
        elif character == SOFT_HYPHEN and not self.print_sh:
            ...

Lucas-C commented 1 year ago

@gmischler : as you have previously worked on this subject, could you have a look this please? 😊

rysson commented 1 year ago

Thanks.

I have some commit to test, see https://github.com/rysson/fpdf2/commit/db0625a6963faff9376ad29f11b54eabddb83a20

I added nbsp argument (default False) to FPDF class.

For string like 1␣2 aaaaa␣3 x␣aaaaa...

with nbsp=False (default):

20230703-165100-812x49

and with nbsp=True:

20230703-165122-814x41

I wanted to create PR, but git hooks failed (see #835).

gmischler commented 1 year ago

@rysson, it took me quite a while to figure out that you're actually talking about whitespace getting stretched when justifying text. I can conclude that from a single line of code, but explicitly explaining that at the start might have been helpful...

This appears to be a "feature" of how PDF viewers are applying the ## Tw declaration in the file only to normal space characters, and not other whitespace with builtin fonts. I haven't checked what the specification says about it, but it may well instruct them to do so.

If we didn't rely on Tw for builtin fonts, we could simply include the NBSP in TextLine.number_of_spaces and in the split pattern for word separation. But as it is, replacing it with a normal space looks like the simplest way to go. However, note that the relevant method is CurrentLine.add_character(), and not in Fragment(). Also, you must not touch self.fragments at that point. You're only substituting a character, not changing anything about the graphics context.

NBSP = "\u00a0"

class CurrentLine:
    ...
    def add_character(...):
        ...
        if character == SPACE:
            ...
        elif character == NBSP:
            # PDF viewers ignore NBSP for word spacing with "Tw".
            character = SPACE
            self.number_of_spaces += 1
        elif character == SOFT_HYPHEN and not self.print_sh:
            ...

Lucas-C commented 1 year ago

Thank you @gmischler for investigating this

rysson commented 1 year ago

@rysson, it took me quite a while to figure out that you're actually talking about whitespace getting stretched when justifying text [...]

It's why I told about fixed-width spaces. Sure, I should write it's important in justify, sorry.

This appears to be a "feature" of how PDF [...]

I'm totally newbie in PDF, I'm sorry.

If we didn't rely on Tw for builtin fonts, we could simply include the NBSP in TextLine.number_of_spaces and in the split pattern for word separation. But as it is, replacing it with a normal space looks like the simplest way to go. However, note that the relevant method is CurrentLine.add_character(), and not in Fragment(). Also, you must not touch self.fragments at that point. You're only substituting a character, not changing anything about the graphics context.

Great! Simpler and working :-) I found the awesome project yesterday. I was just guessing.

So, what next?

gmischler commented 1 year ago

So, what next?

just update your PR accordingly.

With regards to that, I recommend to stick with standard terminology (eg. NON_BREAK_SPACE or NBSP instead of HARD_SPACE). I also think we don't need an option to activate/deactivate this. Other than not triggering line breaks, a NBSP should indeed always behave exactly as a normal space. I can't think of a valid use case for disabling that. There are seperate fixed-width space-type characters available for when a different behaviour is desired. Without the option handling your PR will become a lot simpler, without cluttering the API. You should just add some test cases to verify that it really does what it is supposed to do and won't break with any other future modifications.

Lucas-C commented 1 year ago

I wanted to create PR, but git hooks failed (see https://github.com/PyFPDF/fpdf2/issues/835).

FYI @rysson, this should now be fixed 😊 And thank you for this PR!

rysson commented 1 year ago

Thanks, than I create new PR based on @gmischler notes.

py-pdf / fpdf2

Non-breaking space support in `multi_cell()` and `write_html()` #834