jrmuizel / pdf-extract

A rust library for extracting content from pdfs
396 stars 78 forks source link

Panic at FirstChar #36

Open Grant-Brinkman opened 2 years ago

Grant-Brinkman commented 2 years ago

I am using the most recent version of this crate, and am using it to extract text from old PDF documents. When dealing with PDF documents with PDF version 1.3, it consistently throws the following error:

thread 'main' panicked at 'FirstChar', pdf-extract/src/lib.rs:201:30

Not sure if the issue is actually due to the PDF version, but it seems to be a consistent factor across the PDFs that are causing this panic. For anything version 1.4 or newer it seems to have fewer or different issues.

Here is the backtrace as well:

0:        0x1030e5c11 - std::backtrace_rs::backtrace::libunwind::trace::h0b624e35bf84187c
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:        0x1030e5c11 - std::backtrace_rs::backtrace::trace_unsynchronized::h435d9bd636904605
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:        0x1030e5c11 - std::sys_common::backtrace::_print_fmt::h3ca407d645e7e73d
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sys_common/backtrace.rs:67:5
   3:        0x1030e5c11 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h4f26ffad025fdbe8
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sys_common/backtrace.rs:46:22
   4:        0x10310586b - core::fmt::write::h0a9937d83d3944c1
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/fmt/mod.rs:1168:17
   5:        0x1030e2a68 - std::io::Write::write_fmt::hfaf2e2e92eda8127
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/io/mod.rs:1660:15
   6:        0x1030e7e87 - std::sys_common::backtrace::_print::h11335bd900abe1ce
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sys_common/backtrace.rs:49:5
   7:        0x1030e7e87 - std::sys_common::backtrace::print::hdf5291c87f745042
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sys_common/backtrace.rs:36:9
   8:        0x1030e7e87 - std::panicking::default_hook::{{closure}}::hc11e9b8d348e68b0
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:211:50
   9:        0x1030e7a95 - std::panicking::default_hook::h1d26ec4d0d63be04
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:228:9
  10:        0x1030e8510 - std::panicking::rust_panic_with_hook::hef4f5e524db188b3
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:606:17
  11:        0x1030e823e - std::panicking::begin_panic_handler::{{closure}}::h6e8805ea2351af89
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:502:13
  12:        0x1030e6087 - std::sys_common::backtrace::__rust_end_short_backtrace::hd383ade987b76f63
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/sys_common/backtrace.rs:139:18
  13:        0x1030e7f2a - rust_begin_unwind
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
  14:        0x1031152cf - core::panicking::panic_fmt::hb58956db718d5b79
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
  15:        0x103103c8b - core::panicking::panic_display::hbc9d28d62fda8ebd
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:72:5
  16:        0x103103c3c - core::panicking::panic_str::h157a3bd169616ebc
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:56:5
  17:        0x1031151d9 - core::option::expect_failed::h453cfa4fcdc0da1c
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/option.rs:1817:5
  18:        0x102fd79fc - pdf_extract_custom::get::h0049d79db907b50d
  19:        0x102fd9c9c - pdf_extract_custom::make_font::h069cf264ff12e0dd
  20:        0x102fe17f4 - std::collections::hash::map::Entry<K,V>::or_insert_with::h0c1bd9bdfb41b914
  21:        0x102fde982 - pdf_extract_custom::Processor::process_stream::h57aad9bde7a6ebbf
  22:        0x102fe066b - pdf_extract_custom::output_doc::h7674a2a38c26fe1f
  23:        0x102fd43b5 - pdf_extract_custom::extract_text::h893fb460cfdd03c8
  24:        0x102fcd0df - pdf_indexer::main::hb45e438a30676c8a
  25:        0x102fcd796 - std::sys_common::backtrace::__rust_begin_short_backtrace::hf8b6885c183ef9a5
  26:        0x102fcd78c - std::rt::lang_start::{{closure}}::h839ae8a5a873071a
  27:        0x1030e531e - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h1d1e9294d7151cb0
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/ops/function.rs:259:13
  28:        0x1030e531e - std::panicking::try::do_call::h315943602cc1e70c
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:406:40
  29:        0x1030e531e - std::panicking::try::h5be753f80fffd492
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:370:19
  30:        0x1030e531e - std::panic::catch_unwind::h9fdcb02c74b07e26
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panic.rs:133:14
  31:        0x1030e531e - std::rt::lang_start_internal::{{closure}}::h1558447834abc29f
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/rt.rs:128:48
  32:        0x1030e531e - std::panicking::try::do_call::h5721bf6e49d6926d
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:406:40
  33:        0x1030e531e - std::panicking::try::hee7cffb35a5e550d
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:370:19
  34:        0x1030e531e - std::panic::catch_unwind::hf45e91e6006ab16e
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panic.rs:133:14
  35:        0x1030e531e - std::rt::lang_start_internal::h64086fc6655bfbe8
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/rt.rs:128:20
  36:        0x102fcd759 - _main
jrmuizel commented 2 years ago

Can you attach or link to an example pdf?

Grant-Brinkman commented 2 years ago

test.pdf version1_3.pdf

Attached are two PDFs, where test.pdf is one generated by Word and version1_3.pdf is the same PDF but converted to version 1.3.

When reading the two PDFs using this Rust library, it does not throw an error like the ones I tested before did. There could be some issues with those documents specifically, and I am unable to share them because they have sensitive information.

It seems that the issue is not strictly the PDF version (Although version 1.2 had some missed data when extracting) but that may be a fluke or separate issue.

jrmuizel commented 2 years ago

Can you get a stack from a debug build?

Grant-Brinkman commented 2 years ago

Here is the error that it runs into with RUST_BACKTRACE=1 set.

thread 'main' panicked at 'FirstChar', /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:202:30
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/panicking.rs:498:5
   1: core::panicking::panic_fmt
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:116:14
   2: core::panicking::panic_display
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:72:5
   3: core::panicking::panic_str
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/panicking.rs:56:5
   4: core::option::expect_failed
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/option.rs:1817:5
   5: core::option::Option<T>::expect
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/option.rs:692:21
   6: <T as pdf_extract_custom::FromOptObj>::from_opt_obj
             at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:202:26
   7: pdf_extract_custom::get
             at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:276:5
   8: pdf_extract_custom::PdfSimpleFont::new
             at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:550:35
   9: pdf_extract_custom::make_font
             at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:324:17
  10: pdf_extract_custom::Processor::process_stream::{{closure}}
             at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:1515:84
  11: std::collections::hash::map::Entry<K,V>::or_insert_with
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/std/src/collections/hash/map.rs:2372:43
  12: pdf_extract_custom::Processor::process_stream
             at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:1515:32
  13: pdf_extract_custom::output_doc
             at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:2068:9
  14: pdf_extract_custom::extract_text
             at /Users/admin/Git Repos/rust-based-elasticsearch-pdf-indexer/pdf-extract-custom/src/lib.rs:2029:9
  15: pdf_indexer::main
             at ./src/main.rs:122:24
  16: core::ops::function::FnOnce::call_once
             at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Grant-Brinkman commented 2 years ago

I created a fork of the project that includes the custom changes that I have been using if you want to look at the exact version of pdf-extract that I am using. Forked Repo Here