jrmuizel / pdf-extract

A rust library for extracting content from pdfs
423 stars 79 forks source link

Sanity Check - Unicode Mismatch #61

Open piotroxp opened 1 year ago

piotroxp commented 1 year ago

I have created a PDF search application that scours your folders in search of documents and allows you to find keywords in the document.

At first, I was not using this crate, but at some point it turned out that my app was not finding the right wording in the PDFs. https://github.com/piotroxp/pdfscan

I am learning Rust at the same time when solving my real life need, which is going over terabytes of scientific PDF articles and finding the keywords in them.

Since I want to build a warp drive xD and have a very admirable cache of papers, you can understand that its critical for me to read all files regardless of encoding.

Today marks about 4 hours spent on looking at this error:

Unicode mismatch

For some PDF docs, it works. For others, mainly those downloaded from popular scientific publishers, i am hit with that log.

My repo is attached just so you can understand what I want to achieve.

Wherein is the issue? I am new to Rust. I'm pretty sure that Rust, being a systems programming language, does supply PDF libs regardless of encoding. I can be wrong in that statement.

How can I fix my code? Ideally, I would enjoy the ability to read in bytes raw, and only then transform that representation to utf8. Right now, I am unable to search through sci papers.

This ticket is created just because I find it amusing and mentally challenging to understand what I do wrong. Unless you are doing something wrong, which is also a learning expierience.

jrmuizel commented 1 year ago

Can you share a PDF that this happens with?

piotroxp commented 1 year ago

Sending over some example errors I'm running into along with the files @jrmuizel

IMX8QMAEC-1815333.pdf

Thread started to search in: /nvme2/test/unicode
/nvme2/test/unicode/IMX8QMAEC-1815333.pdf
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
thread '<unnamed>' panicked at 'missing char 49 in map {32: "X", 12: "V", 1: "1", 29: "P", 21: "C", 26: "L", 36: "Y", 34: "G", 3: "3", 7: "7", 15: "M", 23: "E", 14: "_", 17: "N", 2: "2", 5: "5", 24: "T", 6: "6", 19: "D", 8: "8", 11: "A", 18: "U", 37: "Z", 25: "B", 33: "O", 4: "4", 28: "K", 10: "0", 35: "J", 16: "I", 9: "9", 20: "H", 13: "S", 31: "Q", 30: "W", 27: "F", 22: "R"} for <</BaseFont /QCGLDF+ArialMT/Encoding /WinAnsiEncoding/FirstChar 32/FontDescriptor 831 0 R/LastChar 95/Subtype /TrueType/ToUnicode 833 0 R/Type /Font/Widths [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 556 556 556 556 556 556 556 556 556 556 0 0 0 0 0 0 0 666 666 722 722 666 610 777 722 277 500 666 556 833 722 777 666 777 722 666 610 722 666 943 666 666 610 0 0 0 0 556]>>', /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:746:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Search results have been zipped to: search_results_20230623211811.zip

Backtrace for above:

thread '<unnamed>' panicked at 'missing char 49 in map {30: "W", 6: "6", 27: "F", 32: "X", 7: "7", 26: "L", 3: "3", 29: "P", 25: "B", 20: "H", 35: "J", 11: "A", 16: "I", 18: "U", 21: "C", 24: "T", 2: "2", 36: "Y", 19: "D", 22: "R", 34: "G", 37: "Z", 23: "E", 12: "V", 10: "0", 33: "O", 17: "N", 8: "8", 31: "Q", 1: "1", 15: "M", 13: "S", 28: "K", 5: "5", 9: "9", 14: "_", 4: "4"} for <</BaseFont /QCGLDF+ArialMT/Encoding /WinAnsiEncoding/FirstChar 32/FontDescriptor 831 0 R/LastChar 95/Subtype /TrueType/ToUnicode 833 0 R/Type /Font/Widths [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 556 556 556 556 556 556 556 556 556 556 0 0 0 0 0 0 0 666 666 722 722 666 610 777 722 277 500 666 556 833 722 777 666 777 722 666 610 722 666 943 666 666 610 0 0 0 0 556]>>', /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:746:27
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
   2: <pdf_extract::PdfSimpleFont as pdf_extract::PdfFont>::decode_char
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:746:27
   3: <dyn pdf_extract::PdfFont>::decode::{{closure}}
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:714:54
   4: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/ops/function.rs:629:13
   5: core::option::Option<T>::map
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/option.rs:925:29
   6: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/iter/adapters/map.rs:103:9
   7: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/vec/spec_from_iter_nested.rs:26:32
   8: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/vec/spec_from_iter.rs:33:9
   9: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/vec/mod.rs:2748:9
  10: core::iter::traits::iterator::Iterator::collect
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/iter/traits/iterator.rs:1836:9
  11: <dyn pdf_extract::PdfFont>::decode
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:714:23
  12: pdf_extract::show_text
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1144:19
  13: pdf_extract::Processor::process_stream
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1534:29
  14: pdf_extract::Processor::process_stream
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1710:21
  15: pdf_extract::output_doc
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:2118:9
  16: pdf_extract::extract_text_from_mem
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:2076:9
  17: pdf_search::search_phrase_in_pdf
             at ./src/main.rs:24:25
  18: pdf_search::search_pdf_files
             at ./src/main.rs:49:50
  19: pdf_search::main::{{closure}}
             at ./src/main.rs:156:13
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Search results have been zipped to: search_results_20230623212236.zip

Bloom.pdf

Thread started to search in: /nvme2/test/threaderror/
/nvme2/test/threaderror/Bloom.pdf
thread '<unnamed>' panicked at 'no widths', /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:570:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Search results have been zipped to: search_results_20230623211904.zip

Backtrace:

Thread started to search in: /nvme2/test/threaderror/
/nvme2/test/threaderror/Bloom.pdf
thread '<unnamed>' panicked at 'no widths', /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:570:13
stack backtrace:
   0: std::panicking::begin_panic
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:607:12
   1: pdf_extract::PdfSimpleFont::new
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:570:13
   2: pdf_extract::make_font
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:327:17
   3: pdf_extract::Processor::process_stream::{{closure}}
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1554:84
   4: std::collections::hash::map::Entry<K,V>::or_insert_with
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/collections/hash/map.rs:2559:43
   5: pdf_extract::Processor::process_stream
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1554:32
   6: pdf_extract::output_doc
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:2118:9
   7: pdf_extract::extract_text_from_mem
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:2076:9
   8: pdf_search::search_phrase_in_pdf
             at ./src/main.rs:24:25
   9: pdf_search::search_pdf_files
             at ./src/main.rs:49:50
  10: pdf_search::main::{{closure}}
             at ./src/main.rs:156:13
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Search results have been zipped to: search_results_20230623212335.zip

qt7vq3z6v1_noSplash_11cf93c4e513781acd1abae3cbe4e90d.pdf

/nvme2/test/other1/qt7vq3z6v1_noSplash_11cf93c4e513781acd1abae3cbe4e90d.pdf
thread '<unnamed>' panicked at 'missing char 104 in map {217: "Ψ", 56: "\u{f8f1}", 50: "\u{f8ee}", 55: "\u{f8fa}", 208: "Γ", 218: "Ω", 66: "\u{f8ec}", 159: "√", 211: "Λ", 64: "\u{f8ed}", 213: "Π", 60: "\u{f8f2}", 210: "Θ", 62: "\u{f8f4}", 214: "Σ", 57: "\u{f8fc}", 160: " ", 212: "Ξ", 216: "Φ", 54: "\u{f8ef}", 58: "\u{f8f3}", 209: "∆", 61: "\u{f8fd}", 48: "\u{f8eb}", 215: "Υ", 59: "\u{f8fe}", 51: "\u{f8f9}", 52: "\u{f8f0}", 67: "\u{f8f7}", 63: "\u{f8e6}", 65: "\u{f8f8}", 49: "\u{f8f6}", 53: "\u{f8fb}"} for <</BaseFont /BJDRNW+CMEX10/FirstChar 0/FontDescriptor 1542 0 R/LastChar 125/Subtype /Type1/ToUnicode 332 0 R/Type /Font/Widths 1523 0 R>>', /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:746:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Search results have been zipped to: search_results_20230623212029.zip

Backtrace

/nvme2/test/other1/qt7vq3z6v1_noSplash_11cf93c4e513781acd1abae3cbe4e90d.pdf
thread '<unnamed>' panicked at 'missing char 104 in map {211: "Λ", 48: "\u{f8eb}", 64: "\u{f8ed}", 56: "\u{f8f1}", 214: "Σ", 212: "Ξ", 60: "\u{f8f2}", 53: "\u{f8fb}", 49: "\u{f8f6}", 209: "∆", 215: "Υ", 208: "Γ", 62: "\u{f8f4}", 59: "\u{f8fe}", 66: "\u{f8ec}", 52: "\u{f8f0}", 50: "\u{f8ee}", 217: "Ψ", 54: "\u{f8ef}", 159: "√", 210: "Θ", 65: "\u{f8f8}", 55: "\u{f8fa}", 216: "Φ", 57: "\u{f8fc}", 61: "\u{f8fd}", 160: " ", 58: "\u{f8f3}", 63: "\u{f8e6}", 218: "Ω", 213: "Π", 67: "\u{f8f7}", 51: "\u{f8f9}"} for <</BaseFont /BJDRNW+CMEX10/FirstChar 0/FontDescriptor 1542 0 R/LastChar 125/Subtype /Type1/ToUnicode 332 0 R/Type /Font/Widths 1523 0 R>>', /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:746:27
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
   2: <pdf_extract::PdfSimpleFont as pdf_extract::PdfFont>::decode_char
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:746:27
   3: <dyn pdf_extract::PdfFont>::decode::{{closure}}
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:714:54
   4: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/ops/function.rs:629:13
   5: core::option::Option<T>::map
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/option.rs:925:29
   6: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/iter/adapters/map.rs:103:9
   7: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/vec/spec_from_iter_nested.rs:26:32
   8: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/vec/spec_from_iter.rs:33:9
   9: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/vec/mod.rs:2748:9
  10: core::iter::traits::iterator::Iterator::collect
             at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/iter/traits/iterator.rs:1836:9
  11: <dyn pdf_extract::PdfFont>::decode
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:714:23
  12: pdf_extract::show_text
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1144:19
  13: pdf_extract::Processor::process_stream
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1504:41
  14: pdf_extract::output_doc
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:2118:9
  15: pdf_extract::extract_text_from_mem
             at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:2076:9
  16: pdf_search::search_phrase_in_pdf
             at ./src/main.rs:24:25
  17: pdf_search::search_pdf_files
             at ./src/main.rs:49:50
  18: pdf_search::main::{{closure}}
             at ./src/main.rs:156:13
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Search results have been zipped to: search_results_20230623212426.zip
sagarp-patel commented 1 year ago

Hi everyone, I am getting a unicode mismatch error too. It results the entire program quitting via panic. Is it possible if we can figure out if a pdf is suitable to work with the crate or not before hand, or instead of panicking is it possible to return an error, so that we can keep processing the rest of the files?? Here is what I got for my error Unicode mismatch Unicode mismatch Unicode mismatch Unicode mismatch thread '' panicked at 'unexpected smask type 251 0 R',\pdf-extract-0.6.5\src\lib.rs:1209:24 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

piotroxp commented 1 year ago

@sagarp-patel probably the pdf file would be good as well

sagarp-patel commented 1 year ago

@piotroxp I think I found a solution that works for now. You can handle the panics using std::panic::catch_unwind . This will also keep the data from what has already been processed in the pdf so far. So just do a

match std::panic::catch_unwind(move || {
    \\Your PDF processing code that panics goes here
}){
Ok(data)=>{handle whatever data was processed},
Err(err)=>{handle error}
}
saravmajestic commented 1 year ago

i have similar error using this PDF: https://www3.weforum.org/docs/WEF_Future_of_Jobs_2020.pdf

Unicode mismatch true fi "fi" Ok("fi") [64257]
Unicode mismatch true fl "fl" Ok("fl") [64258]
Unicode mismatch true fi "fi" Ok("fi") [64257]
Unicode mismatch true fi "fi" Ok("fi") [64257]
thread 'main' panicked at 'color_space [67, 83, 48] "DeviceN" [/DeviceN, [/Black], /DeviceCMYK, 8390 0 R, 8393 0 R]', /workspace/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:1428:25
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Tried std::panic::catch_unwind as @sagarp-patel mentioned, but still getting error.


let text = match std::panic::catch_unwind(move || extract_text_from_mem(&pdf_bytes)) {
        Ok(text) => text,
        Err(err) => {
            eprintln!("Error extracting text: {:?}", err);
            return;
        }
    };```