jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.02k stars 619 forks source link

update form-parsing example in README #958

Closed jeremybmerrill closed 11 months ago

jeremybmerrill commented 11 months ago

This form-parsing example handles form fields recursively contained within other form fields, removes the incorrect-assumption that field-names are unique and includes the alternate field name in output (which is often a very useful guide to what's in a field).

the prior form-parsing example used the field-name ("T" entry) as the key in the form_data dict, implicitly assuming that the field name is globally unique within a document. That's not a correct assumption; nested field names are often simply a numeric index like 1 or 0. (Nested fields were previously ignored.)

The prior example also entirely ignored the TU entry alternate field name.

jsvine commented 11 months ago

Thanks! A few questions:

jeremybmerrill commented 11 months ago

Given that this does seem to handle nested fields, should we cut this line from right above the code, or is it still true in some way?: "You may have to modify this script to handle cases like nested fields (see page 676 of the specification)."

Yes!

Given your (helpful) familiarity with forms, do you think it'd be possible/wise/unwise to decode the bytes-strings in the results to standard unicode strings?

Let's not overstate my familiarity with this domain :)

I have code to do this. Happy to contribute it, it just seemed a bit out-of-domain. Is there built-in code to pdfplumber for decoding bytestring in the general case? (I just have some munged-together garbage; unclear what's general and what's specific to my specific set of PDFs).

Would you be open to adding a bit of documentation re. what some sample output looks like and what each item in a result represents?

sure!

jsvine commented 11 months ago

Thanks for these updates! One bit of follow-up:

Is there built-in code to pdfplumber for decoding bytestring in the general case?

There is, although I don't know whether it'd work equally well with form data. I'm not 100% confident it would work equally well for form data, but want to give it a shot?

If this feels out of scope or doesn't work, I'm happy just to merge as-is.

jsvine commented 11 months ago

Thanks for that commit! I think the example output might now need tweaking, though?

codecov[bot] commented 11 months ago

Codecov Report

Merging #958 (a97ff07) into stable (7c2d46b) will not change coverage. The diff coverage is n/a.

:exclamation: Current head a97ff07 differs from pull request most recent head 441acdc. Consider uploading reports for the commit 441acdc to get more accurate results

@@            Coverage Diff            @@
##            stable      #958   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           18        18           
  Lines         1588      1588           
=========================================
  Hits          1588      1588           
jeremybmerrill commented 11 months ago

adjusted to use resolve_and_decode (which simplifies by project code substantially too, so ty! :) )

it mostly worked for my project as a drop-in replacement for my custom cleaners. It failed when some form field names include other garbage, e.g. 'DATE Of ACCIDENT\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10\x10/Section 2. Accident information vehicle one enter DATE Of ACCIDENT\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e\x0e',

Is that common, in your experience? would it be worth removing that \x10 and \x0e crap in resolve_and_decode?

(The only PDF forms I've ever parsed is this one specific set of PDF forms, so I'm deep but not broad)

jeremybmerrill commented 11 months ago

@jsvine this is what I'm using to remove the garbage:

def remove_control_characters(s):
    if s is None: return s
    return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

curious if that would better belong in resolve_and_decode, or at least in the utils file (rather than living just in the README)

jsvine commented 11 months ago

Thanks! Couple of follow-ups below.

Is that common, in your experience? would it be worth removing that \x10 and \x0e crap in resolve_and_decode?

Hmmm! I haven't come across that much, if at all. Maybe something specific-ish to your set of forms? For that reason, at least for now, I'd lean toward leaving that cruft-reduction out of the README example, though I'm persuadable otherwise. What do you think?

Other than that, just one more bit I noticed, after testing the new README code out on a couple of PDFs (1, 2):

# I.e., adding resolve to imports
from pdfplumber.utils.pdfinternals import resolve, resolve_and_decode

# This line seems to have been accidentally cut, 
# though now realizing it needed to be tweaked slightly, 
# to resolve(...) instead of [...].resolve()
fields = resolve(pdf.doc.catalog["AcroForm"])["Fields"]
jeremybmerrill commented 11 months ago

Done!

Let's leave out the control-character removal stuff for now, probably idiosyncratic.

jsvine commented 11 months ago

Great, thanks, and merging!

jeremybmerrill commented 11 months ago

Thank you for building and maintaining this tool!