pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.96k stars 930 forks source link

[extracting words from table] #914

Open BaillySylvain opened 1 year ago

BaillySylvain commented 1 year ago

Issue: Words Extracted Too Closely Together in Tables with PDFMiner

Problem Description When extracting text from a PDF document containing tables with PDFMiner, it seems that the words inside the table are being extracted incorrectly. The words are all closely spaced together, making the extracted text unusable. This negatively affects our ability to properly process the information within these tables.

Steps to Reproduce

  1. Call extract_text_to_fp on given pdf : document.pdf
  2. Observe that the extracted words are closely spaced together in xml file : image
  3. These pasted words correspond to a table on the page for which the references are indeed very close image

Expected Behavior We expect PDFMiner to correctly extract text inside tables, preserving the spacing between words, so that the extracted text is usable.

This issue significantly impacts our ability to correctly extract and use data contained within PDF document tables. Any assistance in resolving this problem would be greatly appreciated.

I thank you for the time spent studying this case and hope that a solution can be put in place :)

NickFabry commented 1 year ago

A few suggestions -

1 - Don't use an output type of XML; the comments in high_level.py specifically say only 'text' works properly.

2 - If you are extracting text using extract_text_to_fp as a library call, the exact call signature would be helpful to diagnose what is occurring. In particular, what LAParams are you passing to the call? Typically, you would tweak the LAParams for the nature of your document, specifically the narrow spacing in your case. The kind of parameters you can specify in an LAParams object are:

class LAParams:
    """Parameters for layout analysis

    :param line_overlap: If two characters have more overlap than this they
        are considered to be on the same line. The overlap is specified
        relative to the minimum height of both characters.
    :param char_margin: If two characters are closer together than this
        margin they are considered part of the same line. The margin is
        specified relative to the width of the character.
    :param word_margin: If two characters on the same line are further apart
        than this margin then they are considered to be two separate words, and
        an intermediate space will be added for readability. The margin is
        specified relative to the width of the character.
    :param line_margin: If two lines are are close together they are
        considered to be part of the same paragraph. The margin is
        specified relative to the height of a line.
    :param boxes_flow: Specifies how much a horizontal and vertical position
        of a text matters when determining the order of text boxes. The value
        should be within the range of -1.0 (only horizontal position
        matters) to +1.0 (only vertical position matters). You can also pass
        `None` to disable advanced layout analysis, and instead return text
        based on the position of the bottom left corner of the text box.
    :param detect_vertical: If vertical text should be considered during
        layout analysis
    :param all_texts: If layout analysis should be performed on text in
        figures.
    """

3 - If you are uncomfortable with the above, pdfminer comes with a CLI tool, pdf2txt.py, which you can call on the file whose text you wish to extract, with such parameters as --word-margin, which allows you to define when pdfminer will interpret a physical space between characters as whitespace chars.

In general, when dealing with idiosyncratic documents, one has to simply try different input parameters until the right combination is found for your particular document that allows you to extract the text in a sensible way.