jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

For text extraction, add fractional versions of `x/y_tolerance` arguments #987

Open jsvine opened 10 months ago

jsvine commented 10 months ago

Currently, x_tolerance and y_tolerance are treated as numeric constants. But, as @Sarke points on in https://github.com/jsvine/pdfplumber/issues/606#issuecomment-1703456276, it could be useful to provide a "fractional" version of these arguments:

[...] ideally we would be able to set the tolerance as a fraction of the font-size, since both the words spacing and the line spacing usually change proportionally with the change in font-size.

Implementing this correctly might be tricky, as x/y_tolerance are passed across a few methods, but it should be doable. Some other things to sort out:

Any other questions or complications I may be overlooking?

afriedman412 commented 8 months ago

is anyone working on this? I'd like to take a crack at it if not!

jsvine commented 8 months ago

@afriedman412 I'm not aware of anyone actively working on this, thanks for checking — and thanks for offering! Would be wonderful if you took a crack at it.

afriedman412 commented 8 months ago

great -- do you have a pdf with extra-tight letters I can use for testing?

(the pdf in the original issue worked fine with x_toleraance=1 sooo....)

jsvine commented 8 months ago

@afriedman412 How about something like this?: issue-987-test.pdf

import pdfplumber

pdf = pdfplumber.open("issue-987-test.pdf")
page = pdf.pages[0]

for x in [ 0, 3, 10 ]:
    print(f"--- x_tolerance = {x} ---")
    print(page.extract_text(x_tolerance=x))
    print("")

... outputs this:

--- x_tolerance = 0 ---
Big Te xt
Small Te xt

--- x_tolerance = 3 ---
Big Te xt
Small Text

--- x_tolerance = 10 ---
Big Text
SmallText
afriedman412 commented 8 months ago

sorry im confused -- what do we want it to output?

jsvine commented 8 months ago

Ah, my apologies for not being more explicit. Ideally, the proportional tolerance feature would make it possible to get this back:

Big Text
Small Text

The examples above show (or try to show) that non-proportional tolerances either under-condense the big text or over-condense the small text.

afriedman412 commented 8 months ago

is there a less dumb way to get text size than int(text['bottom'] - text['top'])?

jsvine commented 8 months ago

Are you looking at char objects, or something else (which I might infer from the variable being named text)? If char objects:

I'd have to check more carefully, but I believe those two values should typically be the same.

afriedman412 commented 8 months ago

Honestly I'm lazy and couldn't find an easy way to extract char objects from text, so now it works with anything with a top and bottom param. The int isn't necessary -- I guess I assumed font sizes are always whole numbers?

def get_char_tolerance(t, x_tolerance):
    """Scales x_tolerance to font size (height of text)"""
    if "bottom" in t.keys() and "top" in t.keys():
        return int(t['bottom'] - t['top'])*x_tolerance
    else:
        raise KeyError("Couldn't get height of text for x_tolerance scaling.")

Anyways, fractional x_tolerance mostly works now! The default value is 0.15, because "issue-987-test.pdf" rendered both text sizes correctly between 0.06 and 0.2, and it passed the unit test, and I like round numbers.

Some questions:

jsvine commented 8 months ago

Thanks, @afriedman412! I'll address your specific questions below, but first this seems like a good opportunity for me to sketch out a bit more about how I see this working:

As a matter of actual implementation, things get tricky, as these tolerances are used in several parts of pdfplumber:

... and possibly a few other places, not to mention where these utility functions are integrated into the Page.extract_... methods. This is not to discourage you from implementing this, but rather just a note that there are some tricky bits.

Now on to the specific questions:

Where are the table-extraction values set?

I think here you're asking about the default parameter values for the table-extraction methods? If so, you can find the table-specific ones at the top of this file: https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/table.py

... while the general text extraction defaults are set at the top of this file: https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/utils/text.py

Would it make sense to define all the text defaults in one place? (DEFAULT_X_TOLERANCE is currently just at the top of text.py.)

I'd prefer to keep as-is.

How flexible does this param need to be? I was thinking about scaling it so the default value is 1, but I'm not sure what the range would be

See above, which I believe should answer that question, but let me know if not.

I also kind of feel like making the number a decimal implies scaling, whereas a whole number implies total units

I think we should give users the option to specify these ratios as any number, to give them as much precision as they'd like.

That said, do we want to leave the option in for explicit tolerance?

See above; I think the explicit tolerance should remain the default, both for backward-compatibility's sake and for predictability's sake (the explicit value does not depend on character order, which I believe the ratio version will).

Happy to clarify any of the above and to answer any follow-up Qs! And thanks again!

afriedman412 commented 8 months ago

thanks for all this

my big question is do we want to make calculating tolerances dynamic?

like right now my approach is basically just using the size of first character available to calculate the tolerance. if we are iterating through the lines on a page with cluster_objects we could theoretically adjust ratios on the fly...

jsvine commented 8 months ago

my big question is do we want to make calculating tolerances dynamic? like right now my approach is basically just using the size of first character available to calculate the tolerance.

Good question. At the very least, we want the tolerances to be dynamic between lines.

I think the answer is less clear within lines. The argument in favor of calculating tolerances on a per-character (i.e., within line) basis would be more flexibility for lines containing text of different sizes. Not necessarily the most common occurrence, but a possibility. The argument against would be greater complexity and a (small, probably negligible) performance hit. I'd say let's start experimenting / prototyping without that, and then see how much of a hassle it'd be to add it.

afriedman412 commented 8 months ago

are we sure we need y_tolerance_ratio? off top it feels like line spacing is much less dependent on font size...

I'm going to implement x_tolerance first and we can go from there

jsvine commented 8 months ago

I think that's a reasonable (and smartly constrained) place to start. I think y_tolerance_ratio would still be helpful, but x_tolerance_ratio would certainly still be useful on its own.