maxpmaxp / pdfreader

Python API for PDF documents
MIT License
113 stars 26 forks source link

Overwrite tables in SimplePDFViewer #40

Closed cristianfunes79 closed 4 years ago

cristianfunes79 commented 4 years ago

When parsing a pdf page with 2 tables on it, using SimplePDFViewer.canvas.strings overwrites the content of the first table and just shows the content of the second one.

maxpmaxp commented 4 years ago

@cristianfunes79 Could you attach a pdf file sample?

cristianfunes79 commented 4 years ago

This is the pdf I'm trying to read

Quectel_EC25_EC21_AT_Commands_Manual_V1.0.pdf

In this case I'm reading pages between 14 and 17. This is the output file after reading those pages.

output.txt

This is the script I'm using for that:

command_pdf_generator.txt

cristianfunes79 commented 4 years ago

@maxpmaxp If you could please give me a clue, I can take a look to the source code of the SimpleViewer to check what's happening

maxpmaxp commented 4 years ago

@cristianfunes79 There is nothing wrong with pdfreader. I see strings from the both tables in your output. The thing is that PDF is not a markup-style language as HTML. Page stream actually contains instructions on how to display/print content. This means that 2 words standing next to each other on a page may come in different places in page's content stream, and even be in different content streams.

If your goal is to extract the data, I'd suggest you to parse canvas.text_content, which you can treat as a kind of markdown. (sample for p14 attached).

filename = "Quectel_EC25_EC21_AT_Commands_Manual_V1.0.pdf"

from pdfreader import SimplePDFViewer
fd = open(filename, "rb")

viewer = SimplePDFViewer(fd)
viewer.navigate(14)
viewer.render()
content = viewer.canvas.text_content

All strings come in brackets followed by T-command. Don't be surprised, that words may come as individual characters. Regular expressions will work.

Detailed description of PDF texts is here (see sec. 9).

Have a look at the example in pdfreader documentation