Closed cristianfunes79 closed 4 years ago
@cristianfunes79 Could you attach a pdf file sample?
This is the pdf I'm trying to read
Quectel_EC25_EC21_AT_Commands_Manual_V1.0.pdf
In this case I'm reading pages between 14 and 17. This is the output file after reading those pages.
This is the script I'm using for that:
@maxpmaxp If you could please give me a clue, I can take a look to the source code of the SimpleViewer to check what's happening
@cristianfunes79 There is nothing wrong with pdfreader
. I see strings from the both tables in your output. The thing is that PDF is not a markup-style language as HTML. Page stream actually contains instructions on how to display/print content. This means that 2 words standing next to each other on a page may come in different places in page's content stream, and even be in different content streams.
If your goal is to extract the data, I'd suggest you to parse canvas.text_content
, which you can treat as a kind of markdown. (sample for p14 attached).
filename = "Quectel_EC25_EC21_AT_Commands_Manual_V1.0.pdf"
from pdfreader import SimplePDFViewer
fd = open(filename, "rb")
viewer = SimplePDFViewer(fd)
viewer.navigate(14)
viewer.render()
content = viewer.canvas.text_content
All strings come in brackets followed by T-command. Don't be surprised, that words may come as individual characters. Regular expressions will work.
Detailed description of PDF texts is here (see sec. 9).
Have a look at the example in pdfreader documentation
When parsing a pdf page with 2 tables on it, using SimplePDFViewer.canvas.strings overwrites the content of the first table and just shows the content of the second one.