izderadicka / pdfparser

Python binding to libpoppler with focus on text extraction
98 stars 46 forks source link

Provide access to raw content stream #1

Closed friedelwolff closed 5 years ago

friedelwolff commented 7 years ago

This provides a WordList in the raw order (content stream order) of the document.

izderadicka commented 7 years ago

Could you please add test case for you changes?

friedelwolff commented 7 years ago

I extended dump_file.py to be able to dump the file in raw order. I also added a test document that will clearly show the difference between the two approaches.

I realise now I changed the signature for the class Document. Should we rather update it so that it doesn't require an update to call sites?

izderadicka commented 7 years ago

Hi,

so if I understand correctly, what you need is really a 'raw stream of characters' as they appear on the page - Than I think some more appropriate name for class would be appropriate

I.

On 20/01/17 14:00, friedelwolff wrote:

y the current "word" information from the stream, so it isn't really a WordList (as the name suggest). Ma