Provide access to raw content stream

friedelwolff commented 7 years ago

This provides a WordList in the raw order (content stream order) of the document.

izderadicka commented 7 years ago

Could you please add test case for you changes?

friedelwolff commented 7 years ago

I extended dump_file.py to be able to dump the file in raw order. I also added a test document that will clearly show the difference between the two approaches.

I realise now I changed the signature for the class Document. Should we rather update it so that it doesn't require an update to call sites?

izderadicka commented 7 years ago

Hi,

so if I understand correctly, what you need is really a 'raw stream of characters' as they appear on the page - Than I think some more appropriate name for class would be appropriate

RawChars? RawText? whatever is sounds more correct. As in this case text in not reconstructed to any logical structures, I think most logical would be API consistent with Line - e.g. text, char_boxes and char_fonts properties.

I.

On 20/01/17 14:00, friedelwolff wrote:

y the current "word" information from the stream, so it isn't really a WordList (as the name suggest). Ma

izderadicka / pdfparser

Provide access to raw content stream #1