jrmuizel / pdf-extract

A rust library for extracting content from pdfs
396 stars 78 forks source link

Unsafe get and Missing char #60

Open 0xMimir opened 1 year ago

0xMimir commented 1 year ago

When running examples/extract.rs on lockchain_for_deep_learning.pdf I get following error:

thread 'main' panicked at 'no entry found for key', src/lib.rs:466:58
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Error happens in following line (src/lib.rs) line 4666:

dlog!("{} {}", code, unicode_map[&(code as u32)]);

When replaced with

dlog!("{} {}", code, unicode_map.get(&(code as u32)));

Error message changes to

thread 'main' panicked at 'missing char 2 in map {130: " ", 128: "•"} for <</BaseFont /QAJSTB+AdvPSSym/Encoding 1219 0 R/FirstChar 2/FontDescriptor 1221 0 R/LastChar 130/Subtype /Type1/ToUnicode 1202 0 R/Type /Font/Widths [791 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 458 0 0]>>', src/lib.rs:873:21
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
0xMimir commented 1 year ago

Here is pdf causing the issue: blockchain_for_deep_learning.pdf

piotroxp commented 1 year ago

I am running into the same issue for many of my PDFs from scientific publishers.

Regardless of font, for me its an issue with UTF8 encoding. At least from a first shot at a solution, it seems all PDFs need to be converted to utf8 on load.

I am hallucinating here, but i am also learning Rust at the same time as building a tool for myself.

anagrius commented 1 year ago

Same issue