py-pdf / fpdf2

Simple PDF generation for Python
https://py-pdf.github.io/fpdf2/
GNU Lesser General Public License v3.0
993 stars 227 forks source link

Proposal for Generic Robust Substitution Mechanism #1202

Open gmischler opened 2 weeks ago

gmischler commented 2 weeks ago

Background The processing model of fpdf2 of pretty much "write any user input to the output stream immediately" makes it difficult and often nearly impossible to dynamically adapt to the characteristics (eg. size) of the input data in many situations.

We currently have one formal substitution mechanism, which allows to use "{nb}" to insert the total number of pages before it is known. This approach is inherently problematic, first because it may conflict with a possible intention to render the same character sequence on page, and also because it conflicts with text shaping.

There's another possible use case for late value substitution: For #678 and #1154, a solution might be to wrap the page content in a transformation (move or rotation), where the actual parameters of that transformation only become known once the page is complete.

Other use cases for similar substitutions may come up with time.

Solution The robustness could most easily be improved by replacing the explicit string "{nb}" in the output stream with a sequence containing noncharacters. These are special Unicode code points for private and strictly internal use, which means they should never be shared or transferred between different software packages. This makes them safe for use as conflict-free substitution markers.

Note that by the Unicode standard we should not accept such markers from client software, as the noncharacters are strictly for internal use. So for more generic user interaction we need to define a hierarchy of token classes that allow to specify the type and size of the values to substitute.

    SubstitutionToken
        IntSubToken
            TotalPagesToken (predefined singleton)
        FloatSubToken
        TextSubToken

When used for rendering as text, those tokens get converted into a special type of Fragment() (they could even be derived from Fragment() themselfes). Their important properties are a unique key (automatically generated) to distinguish between intended substitution targets and a width (likely in pica), which will be used by _render_styled_text_line() to determine where to continue with the following text. Those substitution Fragment()s will be ignored by text shaping. This will make some subistitution results slightly less pretty, and tokens can't be substituted by text using a complex script, but that should be an acceptable limitation. Most likely the substitution Fragment()s should also be considered "unbreakable". Since we don't know their actual content yet when doing the line wrapping, there's really no other way to handle them at that point.

They get written into the output stream in a form eg. like this:

marker_pattern = f"\uFDD0{key:d}\uFDD1"

FPDF.write() and the text regions could be combined with special methods like .insert_total_pages(width="3em").

For backwards compability with the (then deprecated) use of "{nb}", the text parsing routines could replace that string with an appropriate subtype of Fragment(), to be written as a marker as shown.

In all text input methods, maybe we can allow the user to use their own "{format}" keys in the text, with our methods accepting a dict of key/token pairs, which they will then convert internally into the appropriate marker strings.

    my_text = "page {current_pageno} of {my_total_pages}"
    pageno_token = IntSubToken(width="3em")
    pdf.cell(text=my_text, substitute=dict(current_pageno=pageno_token, my_total_pages=TotalPagesToken))
    # add some other stuff
    pageno_token.set_value(42)
    # TotalPagesToken may get automatically updated
    pdf.output("substitution_demo.pdf")

Many other FPDF methods may accept substitution tokens in place of explicit values. Eg. a transformation may accept an instance of a float-type token in place of an actual float. Before writing the file, the user must then update their copy of the token with the correct value, which will cause its to be replaced in the output. Forgetting to set the value of a used token is an error.

    y_move_token = FloatSubToken()
    with pdf.move(x=0, y=y_move_token): # analog to skew(), rotation(), etc.
        # create content
    remaining_y = pdf.eph - pdf.y
    y_move_token.set_value(remaining_y)
    pdf.pages[pdf.page].set_dimensions(pdf.w_pt, (pdf.y + pdf.t_margin)*pdf.k)
    pdf.output("substitution_demo.pdf")

Sorry I couldn't come up with a simpler solution, but there are many different constraints in the different phases of processing, all of which need to be addressed.

Any better ideas? 💡 Any takers?

andersonhc commented 2 weeks ago

I had already started working on refactoring the alias code to fix #1090. I just submitted the PR. I believe it is one step closer on your vision for the substitution mechanism.