jrmuizel / pdf-extract

A rust library for extracting content from pdfs
364 stars 73 forks source link

missing char 48 in map #56

Open gravit22 opened 1 year ago

gravit22 commented 1 year ago

The crate is panicking when I try to extract text from pdf: thread 'main' panicked at 'missing char 48 in map {40: "R", 7: "i", 31: "v", 59: "-", 57: "]", 18: "C", 26: "P", 28: "H", 4: "w", 37: "f", 5: "e", 51: "A", 50: "q", 43: "x", 46: "”", 25: "k", 27: ".", 60: "O", 34: "/", 52: "(", 17: "h", 11: "p", 30: "B", 10: " ", 2: "l", 24: "’", 14: "s", 35: "S", 3: "o", 29: "!", 45: "“", 44: "W", 41: "V", 15: "d", 19: "m", 47: "→", 13: "t", 20: "b", 53: ")", 16: "u", 12: "a", 58: "G", 9: "g", 38: "z", 55: ";", 56: "[", 1: "F", 39: "E", 42: "D", 49: "‘", 54: "J", 36: "I", 6: "r", 8: "n", 21: "c", 23: "y", 32: "j", 33: ",", 22: "T"}', /home/mykhailo/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.4/src/lib.rs:733:27

jrmuizel commented 1 year ago

Can you share the pdf?

gravit22 commented 1 year ago

Can you share the pdf?

Sure. newithkuil_affixes.pdf

jrmuizel commented 1 year ago

[e.g., canyon → rift valley] is the problematic area. The ft ligature is causing trouble. It seems most other pdf readers convert this to 0. i.e. ri0 valley. It'd be nice to do better.

gravit22 commented 1 year ago

Thank you for figuring it out. How can it be fixed?

jrmuizel commented 1 year ago

I'd like to better understand how the other programs are coming up with '0' for that glyph