Open piotroxp opened 1 year ago
Can you share a PDF that this happens with?
Sending over some example errors I'm running into along with the files @jrmuizel
Thread started to search in: /nvme2/test/unicode
/nvme2/test/unicode/IMX8QMAEC-1815333.pdf
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
thread '<unnamed>' panicked at 'missing char 49 in map {32: "X", 12: "V", 1: "1", 29: "P", 21: "C", 26: "L", 36: "Y", 34: "G", 3: "3", 7: "7", 15: "M", 23: "E", 14: "_", 17: "N", 2: "2", 5: "5", 24: "T", 6: "6", 19: "D", 8: "8", 11: "A", 18: "U", 37: "Z", 25: "B", 33: "O", 4: "4", 28: "K", 10: "0", 35: "J", 16: "I", 9: "9", 20: "H", 13: "S", 31: "Q", 30: "W", 27: "F", 22: "R"} for <</BaseFont /QCGLDF+ArialMT/Encoding /WinAnsiEncoding/FirstChar 32/FontDescriptor 831 0 R/LastChar 95/Subtype /TrueType/ToUnicode 833 0 R/Type /Font/Widths [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 556 556 556 556 556 556 556 556 556 556 0 0 0 0 0 0 0 666 666 722 722 666 610 777 722 277 500 666 556 833 722 777 666 777 722 666 610 722 666 943 666 666 610 0 0 0 0 556]>>', /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:746:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Search results have been zipped to: search_results_20230623211811.zip
Backtrace for above:
thread '<unnamed>' panicked at 'missing char 49 in map {30: "W", 6: "6", 27: "F", 32: "X", 7: "7", 26: "L", 3: "3", 29: "P", 25: "B", 20: "H", 35: "J", 11: "A", 16: "I", 18: "U", 21: "C", 24: "T", 2: "2", 36: "Y", 19: "D", 22: "R", 34: "G", 37: "Z", 23: "E", 12: "V", 10: "0", 33: "O", 17: "N", 8: "8", 31: "Q", 1: "1", 15: "M", 13: "S", 28: "K", 5: "5", 9: "9", 14: "_", 4: "4"} for <</BaseFont /QCGLDF+ArialMT/Encoding /WinAnsiEncoding/FirstChar 32/FontDescriptor 831 0 R/LastChar 95/Subtype /TrueType/ToUnicode 833 0 R/Type /Font/Widths [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 556 556 556 556 556 556 556 556 556 556 0 0 0 0 0 0 0 666 666 722 722 666 610 777 722 277 500 666 556 833 722 777 666 777 722 666 610 722 666 943 666 666 610 0 0 0 0 556]>>', /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:746:27
stack backtrace:
0: rust_begin_unwind
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
1: core::panicking::panic_fmt
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
2: <pdf_extract::PdfSimpleFont as pdf_extract::PdfFont>::decode_char
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:746:27
3: <dyn pdf_extract::PdfFont>::decode::{{closure}}
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:714:54
4: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/ops/function.rs:629:13
5: core::option::Option<T>::map
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/option.rs:925:29
6: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/iter/adapters/map.rs:103:9
7: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/vec/spec_from_iter_nested.rs:26:32
8: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/vec/spec_from_iter.rs:33:9
9: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/vec/mod.rs:2748:9
10: core::iter::traits::iterator::Iterator::collect
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/iter/traits/iterator.rs:1836:9
11: <dyn pdf_extract::PdfFont>::decode
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:714:23
12: pdf_extract::show_text
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1144:19
13: pdf_extract::Processor::process_stream
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1534:29
14: pdf_extract::Processor::process_stream
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1710:21
15: pdf_extract::output_doc
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:2118:9
16: pdf_extract::extract_text_from_mem
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:2076:9
17: pdf_search::search_phrase_in_pdf
at ./src/main.rs:24:25
18: pdf_search::search_pdf_files
at ./src/main.rs:49:50
19: pdf_search::main::{{closure}}
at ./src/main.rs:156:13
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Search results have been zipped to: search_results_20230623212236.zip
Thread started to search in: /nvme2/test/threaderror/
/nvme2/test/threaderror/Bloom.pdf
thread '<unnamed>' panicked at 'no widths', /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:570:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Search results have been zipped to: search_results_20230623211904.zip
Backtrace:
Thread started to search in: /nvme2/test/threaderror/
/nvme2/test/threaderror/Bloom.pdf
thread '<unnamed>' panicked at 'no widths', /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:570:13
stack backtrace:
0: std::panicking::begin_panic
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:607:12
1: pdf_extract::PdfSimpleFont::new
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:570:13
2: pdf_extract::make_font
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:327:17
3: pdf_extract::Processor::process_stream::{{closure}}
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1554:84
4: std::collections::hash::map::Entry<K,V>::or_insert_with
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/collections/hash/map.rs:2559:43
5: pdf_extract::Processor::process_stream
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1554:32
6: pdf_extract::output_doc
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:2118:9
7: pdf_extract::extract_text_from_mem
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:2076:9
8: pdf_search::search_phrase_in_pdf
at ./src/main.rs:24:25
9: pdf_search::search_pdf_files
at ./src/main.rs:49:50
10: pdf_search::main::{{closure}}
at ./src/main.rs:156:13
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Search results have been zipped to: search_results_20230623212335.zip
qt7vq3z6v1_noSplash_11cf93c4e513781acd1abae3cbe4e90d.pdf
/nvme2/test/other1/qt7vq3z6v1_noSplash_11cf93c4e513781acd1abae3cbe4e90d.pdf
thread '<unnamed>' panicked at 'missing char 104 in map {217: "Ψ", 56: "\u{f8f1}", 50: "\u{f8ee}", 55: "\u{f8fa}", 208: "Γ", 218: "Ω", 66: "\u{f8ec}", 159: "√", 211: "Λ", 64: "\u{f8ed}", 213: "Π", 60: "\u{f8f2}", 210: "Θ", 62: "\u{f8f4}", 214: "Σ", 57: "\u{f8fc}", 160: " ", 212: "Ξ", 216: "Φ", 54: "\u{f8ef}", 58: "\u{f8f3}", 209: "∆", 61: "\u{f8fd}", 48: "\u{f8eb}", 215: "Υ", 59: "\u{f8fe}", 51: "\u{f8f9}", 52: "\u{f8f0}", 67: "\u{f8f7}", 63: "\u{f8e6}", 65: "\u{f8f8}", 49: "\u{f8f6}", 53: "\u{f8fb}"} for <</BaseFont /BJDRNW+CMEX10/FirstChar 0/FontDescriptor 1542 0 R/LastChar 125/Subtype /Type1/ToUnicode 332 0 R/Type /Font/Widths 1523 0 R>>', /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:746:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Search results have been zipped to: search_results_20230623212029.zip
Backtrace
/nvme2/test/other1/qt7vq3z6v1_noSplash_11cf93c4e513781acd1abae3cbe4e90d.pdf
thread '<unnamed>' panicked at 'missing char 104 in map {211: "Λ", 48: "\u{f8eb}", 64: "\u{f8ed}", 56: "\u{f8f1}", 214: "Σ", 212: "Ξ", 60: "\u{f8f2}", 53: "\u{f8fb}", 49: "\u{f8f6}", 209: "∆", 215: "Υ", 208: "Γ", 62: "\u{f8f4}", 59: "\u{f8fe}", 66: "\u{f8ec}", 52: "\u{f8f0}", 50: "\u{f8ee}", 217: "Ψ", 54: "\u{f8ef}", 159: "√", 210: "Θ", 65: "\u{f8f8}", 55: "\u{f8fa}", 216: "Φ", 57: "\u{f8fc}", 61: "\u{f8fd}", 160: " ", 58: "\u{f8f3}", 63: "\u{f8e6}", 218: "Ω", 213: "Π", 67: "\u{f8f7}", 51: "\u{f8f9}"} for <</BaseFont /BJDRNW+CMEX10/FirstChar 0/FontDescriptor 1542 0 R/LastChar 125/Subtype /Type1/ToUnicode 332 0 R/Type /Font/Widths 1523 0 R>>', /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:746:27
stack backtrace:
0: rust_begin_unwind
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
1: core::panicking::panic_fmt
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
2: <pdf_extract::PdfSimpleFont as pdf_extract::PdfFont>::decode_char
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:746:27
3: <dyn pdf_extract::PdfFont>::decode::{{closure}}
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:714:54
4: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &mut F>::call_once
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/ops/function.rs:629:13
5: core::option::Option<T>::map
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/option.rs:925:29
6: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::next
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/iter/adapters/map.rs:103:9
7: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/vec/spec_from_iter_nested.rs:26:32
8: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/vec/spec_from_iter.rs:33:9
9: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/vec/mod.rs:2748:9
10: core::iter::traits::iterator::Iterator::collect
at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/iter/traits/iterator.rs:1836:9
11: <dyn pdf_extract::PdfFont>::decode
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:714:23
12: pdf_extract::show_text
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1144:19
13: pdf_extract::Processor::process_stream
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:1504:41
14: pdf_extract::output_doc
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:2118:9
15: pdf_extract::extract_text_from_mem
at /home/piotro/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.5/src/lib.rs:2076:9
16: pdf_search::search_phrase_in_pdf
at ./src/main.rs:24:25
17: pdf_search::search_pdf_files
at ./src/main.rs:49:50
18: pdf_search::main::{{closure}}
at ./src/main.rs:156:13
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Search results have been zipped to: search_results_20230623212426.zip
Hi everyone, I am getting a unicode mismatch error too. It results the entire program quitting via panic. Is it possible if we can figure out if a pdf is suitable to work with the crate or not before hand, or instead of panicking is it possible to return an error, so that we can keep processing the rest of the files??
Here is what I got for my error
Unicode mismatch
Unicode mismatch
Unicode mismatch
Unicode mismatch
thread 'RUST_BACKTRACE=1
environment variable to display a backtrace
@sagarp-patel probably the pdf file would be good as well
@piotroxp I think I found a solution that works for now. You can handle the panics using std::panic::catch_unwind . This will also keep the data from what has already been processed in the pdf so far. So just do a
match std::panic::catch_unwind(move || {
\\Your PDF processing code that panics goes here
}){
Ok(data)=>{handle whatever data was processed},
Err(err)=>{handle error}
}
i have similar error using this PDF: https://www3.weforum.org/docs/WEF_Future_of_Jobs_2020.pdf
Unicode mismatch true fi "fi" Ok("fi") [64257]
Unicode mismatch true fl "fl" Ok("fl") [64258]
Unicode mismatch true fi "fi" Ok("fi") [64257]
Unicode mismatch true fi "fi" Ok("fi") [64257]
thread 'main' panicked at 'color_space [67, 83, 48] "DeviceN" [/DeviceN, [/Black], /DeviceCMYK, 8390 0 R, 8393 0 R]', /workspace/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pdf-extract-0.7.2/src/lib.rs:1428:25
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Tried std::panic::catch_unwind
as @sagarp-patel mentioned, but still getting error.
let text = match std::panic::catch_unwind(move || extract_text_from_mem(&pdf_bytes)) {
Ok(text) => text,
Err(err) => {
eprintln!("Error extracting text: {:?}", err);
return;
}
};```
I have created a PDF search application that scours your folders in search of documents and allows you to find keywords in the document.
At first, I was not using this crate, but at some point it turned out that my app was not finding the right wording in the PDFs. https://github.com/piotroxp/pdfscan
I am learning Rust at the same time when solving my real life need, which is going over terabytes of scientific PDF articles and finding the keywords in them.
Since I want to build a warp drive xD and have a very admirable cache of papers, you can understand that its critical for me to read all files regardless of encoding.
Today marks about 4 hours spent on looking at this error:
For some PDF docs, it works. For others, mainly those downloaded from popular scientific publishers, i am hit with that log.
My repo is attached just so you can understand what I want to achieve.
Wherein is the issue? I am new to Rust. I'm pretty sure that Rust, being a systems programming language, does supply PDF libs regardless of encoding. I can be wrong in that statement.
How can I fix my code? Ideally, I would enjoy the ability to read in bytes raw, and only then transform that representation to utf8. Right now, I am unable to search through sci papers.
This ticket is created just because I find it amusing and mentally challenging to understand what I do wrong. Unless you are doing something wrong, which is also a learning expierience.