Open gwc4github opened 3 years ago
I just ran into this issue as well, so posting a solution for posterity :)
The text within a Form XObject is nested in a LTFigure. In order to extract the text from Form XObjects, you need to do the following:
all_texts=True
to your LAParams, e.g. laparams = LAParams(alltexts=True)
LTFigure
to find text. see snippet belowfor page in PDFPage.create_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for lt_obj in layout:
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
extracted_text += lt_obj.get_text()
elif isinstance(lt_obj, LTFigure):
for lt_obj_inner in lt_obj:
if isinstance(lt_obj_inner, LTTextBox) or isinstance(lt_obj_inner, LTTextLine):
extracted_text += lt_obj_inner.get_text()
Bug report
A description of the bug When I convert my pdf to text a lot of the contents are completely missing from the output. All of the missing text seems to be in the form cells. Also, the missing "text" seems to all be numeric. (wages, SSN, Income Tax, etc.) Note that this is not an Acro form according to the test 'AcroForm' not in res The form is a "2018 W-2 and EARNINGS SUMMARY". Unfortunately I cannot share the file. I am hoping someone can give me some suggestions anyway.
Steps to reproduce the bug. Try to minimize the number of steps needed. Include the command and/or script that you use. Also include the PDF that you use. To reproduce the error, run the following code after changing the path to where you saved the file: