jrmuizel / pdf-extract

A rust library for extracting content from pdfs
396 stars 78 forks source link

Handle documents missing colorspace #19

Open eutampieri opened 4 years ago

eutampieri commented 4 years ago

The crate causes a panic


thread 'main' panicked at 'missing colorspace [67, 83, 112]', /home/eugenio/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.1/src/lib.rs:1248:85
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.40/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print_fmt
             at src/libstd/sys_common/backtrace.rs:77
   3: <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt
             at src/libstd/sys_common/backtrace.rs:61
   4: core::fmt::write
             at src/libcore/fmt/mod.rs:1028
   5: std::io::Write::write_fmt
             at src/libstd/io/mod.rs:1412
   6: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:65
   7: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:50
   8: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:188
   9: std::panicking::default_hook
             at src/libstd/panicking.rs:205
  10: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:464
  11: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:373
  12: std::panicking::begin_panic_fmt
             at src/libstd/panicking.rs:328
  13: pdf_extract::make_colorspace::{{closure}}
             at /home/eugenio/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.1/src/lib.rs:1248
  14: core::option::Option<T>::unwrap_or_else
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libcore/option.rs:419
  15: pdf_extract::make_colorspace
             at /home/eugenio/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.1/src/lib.rs:1248
  16: pdf_extract::Processor::process_stream
             at /home/eugenio/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.1/src/lib.rs:1369
  17: pdf_extract::output_doc
             at /home/eugenio/.cargo/registry/src/github.com-1ecc6299db9ec823/pdf-extract-0.6.1/src/lib.rs:1979
  18: dataset::main
             at src/main.rs:36
  19: std::rt::lang_start::{{closure}}
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/rt.rs:61
  20: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:48
  21: std::panicking::try::do_call
             at src/libstd/panicking.rs:287
  22: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:78
  23: std::panicking::try
             at src/libstd/panicking.rs:265
  24: std::panic::catch_unwind
             at src/libstd/panic.rs:396
  25: std::rt::lang_start_internal
             at src/libstd/rt.rs:47
  26: std::rt::lang_start
             at /rustc/73528e339aae0f17a15ffa49a8ac608f50c6cf14/src/libstd/rt.rs:61
  27: main
  28: __libc_start_main
  29: _start
note: Some details are omitted, run with RUST_BACKTRACE=full for a verbose backtrace.
ngirard commented 2 years ago

I'm experiencing the same problem !

Grant-Brinkman commented 2 years ago

Using the fixes suggested in PR 22 Fixed it for me. Was running into the same problem. If you just need text extracted, it is also possible to just comment out this section from lib.rs:

"CS" => {
    let name = operation.operands[0].as_name().unwrap_or(b"Default");
    gs.stroke_colorspace = make_colorspace(doc, name, resources);
}
"cs" => {
    let name = operation.operands[0].as_name().unwrap_or(b"Default");
    gs.fill_colorspace = make_colorspace(doc, name, resources);
}

Can't guarantee it will do what you want, but was enough to suppress the error before I applied the suggested fix from PR #22 and didn't seem to affect text extraction.