jrmuizel / pdf-extract

A rust library for extracting content from pdfs
368 stars 75 forks source link

Word spacing is not applied correctly #35

Closed badicsalex closed 2 years ago

badicsalex commented 2 years ago

In the show_text code, word spacing is not applied correctly. https://github.com/jrmuizel/pdf-extract/blob/38b1f156857f4d57ddd9d89c194111e23f5cb90e/src/lib.rs#L1158

If you take a look at pdfminer's implementation, word spacing is added on top of character spacing: https://github.com/pdfminer/pdfminer.six/blob/43c8fc8557528463c99598049b7005ae96ab8084/pdfminer/pdfdevice.py#L171 https://github.com/pdfminer/pdfminer.six/blob/43c8fc8557528463c99598049b7005ae96ab8084/pdfminer/pdfdevice.py#L183

The rust implementation isn't exactly in other ways, because char spacing is added in a bit of a finicky way, but patching the code to ts.word_spacing + ts.character_spacing was good enough for me.

badicsalex commented 2 years ago

The mismatch between pdf-extract's coordinates and the actual PDF coordinates is very apparent in https://github.com/badicsalex/hun_law_rs/blob/master/tests/cheap/data/2010_181_part.pdf at "Németországi Szövetségi Köztársaság,", there is a gap between the "v" and "e", because there were multiple spaces used and the error accumulated.