jrmuizel / pdf-extract

A rust library for extracting content from pdfs
396 stars 78 forks source link

when extract_text will cast an assert_error #14

Open Qingluan opened 5 years ago

Qingluan commented 5 years ago

panicked at 'assertion failed:xxxx assert error file in .cargo/git/checkouts/pdf-extract-1e3ad5dc34c14d18/5eca5d5/src/lib.rs:833:17 . like

        let base_name = get_name_string(doc, font, b"BaseFont");
        let descendants = maybe_get_array(doc, font, b"DescendantFonts").expect("Descendant fonts required");
        let ciddict = maybe_deref(doc, &descendants[0]).as_dict().expect("should be CID dict");
        let encoding = maybe_get_obj(doc, font, b"Encoding").expect("Encoding required in type0 fonts");
        dlog!("base_name {} {:?}", base_name, font);

        match encoding {
            &Object::Name(ref name) => {
                let name = pdf_to_utf8(name);
                dlog!("encoding {:?}", name);
                assert!(name == "Identity-H");
            }
            &Object::Stream(ref stream) => {
                let contents = get_contents(stream);
                dlog!("Stream: {}", String::from_utf8(contents.clone()).unwrap());
            }
            _ => { panic!("unsupported encoding {:?}", encoding)}
        }

i guess font encoding is not utf-8? stack trace :

stack backtrace:
   0: std::panicking::default_hook::{{closure}}
   1: std::panicking::default_hook
   2: std::panicking::rust_panic_with_hook
   3: std::panicking::begin_panic
   4: pdf_extract::make_font
   5: pdf_extract::Processor::process_stream
   6: pdf_extract::Processor::process_stream
   7: pdf_extract::output_doc
   8: pdf_extract::extract_text
   9: extract_text::text::Text::from_file
  10: extract_text::main
  11: std::rt::lang_start::{{closure}}

12: std::panicking::try::do_call