Open BaillySylvain opened 1 year ago
A few suggestions -
1 - Don't use an output type of XML; the comments in high_level.py specifically say only 'text' works properly.
2 - If you are extracting text using extract_text_to_fp as a library call, the exact call signature would be helpful to diagnose what is occurring. In particular, what LAParams are you passing to the call? Typically, you would tweak the LAParams for the nature of your document, specifically the narrow spacing in your case. The kind of parameters you can specify in an LAParams object are:
class LAParams:
"""Parameters for layout analysis
:param line_overlap: If two characters have more overlap than this they
are considered to be on the same line. The overlap is specified
relative to the minimum height of both characters.
:param char_margin: If two characters are closer together than this
margin they are considered part of the same line. The margin is
specified relative to the width of the character.
:param word_margin: If two characters on the same line are further apart
than this margin then they are considered to be two separate words, and
an intermediate space will be added for readability. The margin is
specified relative to the width of the character.
:param line_margin: If two lines are are close together they are
considered to be part of the same paragraph. The margin is
specified relative to the height of a line.
:param boxes_flow: Specifies how much a horizontal and vertical position
of a text matters when determining the order of text boxes. The value
should be within the range of -1.0 (only horizontal position
matters) to +1.0 (only vertical position matters). You can also pass
`None` to disable advanced layout analysis, and instead return text
based on the position of the bottom left corner of the text box.
:param detect_vertical: If vertical text should be considered during
layout analysis
:param all_texts: If layout analysis should be performed on text in
figures.
"""
3 - If you are uncomfortable with the above, pdfminer comes with a CLI tool, pdf2txt.py, which you can call on the file whose text you wish to extract, with such parameters as --word-margin, which allows you to define when pdfminer will interpret a physical space between characters as whitespace chars.
In general, when dealing with idiosyncratic documents, one has to simply try different input parameters until the right combination is found for your particular document that allows you to extract the text in a sensible way.
Issue: Words Extracted Too Closely Together in Tables with PDFMiner
Problem Description When extracting text from a PDF document containing tables with PDFMiner, it seems that the words inside the table are being extracted incorrectly. The words are all closely spaced together, making the extracted text unusable. This negatively affects our ability to properly process the information within these tables.
Steps to Reproduce
Expected Behavior We expect PDFMiner to correctly extract text inside tables, preserving the spacing between words, so that the extracted text is usable.
This issue significantly impacts our ability to correctly extract and use data contained within PDF document tables. Any assistance in resolving this problem would be greatly appreciated.
I thank you for the time spent studying this case and hope that a solution can be put in place :)