Closed badicsalex closed 2 years ago
The mismatch between pdf-extract's coordinates and the actual PDF coordinates is very apparent in https://github.com/badicsalex/hun_law_rs/blob/master/tests/cheap/data/2010_181_part.pdf at "Németországi Szövetségi Köztársaság,", there is a gap between the "v" and "e", because there were multiple spaces used and the error accumulated.
In the
show_text
code, word spacing is not applied correctly. https://github.com/jrmuizel/pdf-extract/blob/38b1f156857f4d57ddd9d89c194111e23f5cb90e/src/lib.rs#L1158If you take a look at pdfminer's implementation, word spacing is added on top of character spacing: https://github.com/pdfminer/pdfminer.six/blob/43c8fc8557528463c99598049b7005ae96ab8084/pdfminer/pdfdevice.py#L171 https://github.com/pdfminer/pdfminer.six/blob/43c8fc8557528463c99598049b7005ae96ab8084/pdfminer/pdfdevice.py#L183
The rust implementation isn't exactly in other ways, because char spacing is added in a bit of a finicky way, but patching the code to
ts.word_spacing + ts.character_spacing
was good enough for me.