Open Grant-Brinkman opened 2 years ago
Can you attach or link to an example pdf?
Attached are two PDFs, where test.pdf is one generated by Word and version1_3.pdf is the same PDF but converted to version 1.3.
When reading the two PDFs using this Rust library, it does not throw an error like the ones I tested before did. There could be some issues with those documents specifically, and I am unable to share them because they have sensitive information.
It seems that the issue is not strictly the PDF version (Although version 1.2 had some missed data when extracting) but that may be a fluke or separate issue.
Can you get a stack from a debug build?
Here is the error that it runs into with RUST_BACKTRACE=1 set.
thread 'main' panicked at 'FirstChar', /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:202:30
stack backtrace:
0: rust_begin_unwind
at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
1: core::panicking::panic_fmt
at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
2: core::panicking::panic_display
at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:72:5
3: core::panicking::panic_str
at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:56:5
4: core::option::expect_failed
at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/option.rs:1817:5
5: core::option::Option<T>::expect
at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/option.rs:692:21
6: <T as pdf_extract_custom::FromOptObj>::from_opt_obj
at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:202:26
7: pdf_extract_custom::get
at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:276:5
8: pdf_extract_custom::PdfSimpleFont::new
at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:550:35
9: pdf_extract_custom::make_font
at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:324:17
10: pdf_extract_custom::Processor::process_stream::{{closure}}
at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:1515:84
11: std::collections::hash::map::Entry<K,V>::or_insert_with
at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/collections/hash/map.rs:2372:43
12: pdf_extract_custom::Processor::process_stream
at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:1515:32
13: pdf_extract_custom::output_doc
at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:2068:9
14: pdf_extract_custom::extract_text
at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:2029:9
15: pdf_indexer::main
at ./src/main.rs:122:24
16: core::ops::function::FnOnce::call_once
at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
I created a fork of the project that includes the custom changes that I have been using if you want to look at the exact version of pdf-extract
that I am using.
Forked Repo Here
I am using the most recent version of this crate, and am using it to extract text from old PDF documents. When dealing with PDF documents with PDF version 1.3, it consistently throws the following error:
Not sure if the issue is actually due to the PDF version, but it seems to be a consistent factor across the PDFs that are causing this panic. For anything version 1.4 or newer it seems to have fewer or different issues.
Here is the backtrace as well: