pdf-rs / pdf

Rust library to read, manipulate and write PDF files.
MIT License
1.31k stars 122 forks source link

trying to extract text, not all strings are present - looks like those with non-latin characters are gone #102

Open Niedzwiedzw opened 3 years ago

Niedzwiedzw commented 3 years ago

I'm unable to provide an example pdf cause it contains sensitive data though :(

s3bk commented 3 years ago

@Niedzwiedzw which approach are you using? I will try to give you instructions on how to get the relevant information without leaking the sensitive data tomorrow.

Niedzwiedzw commented 3 years ago

I've switched to master branch to be able to use named enum-style Ops, but now it doesn't load the document at all

thread 'parser::lotos::test_parser::test_example_files_parse' panicked at 'bad page?: 
 Try { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/9e56f00/pdf/src/file.rs", line: 94, column: 19, source: 
 Try { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/9e56f00/pdf/src/object/types.rs", line: 22, column: 42, 
 source: FromPrimitive { 
   typ: "Option < Content >", 
   field: "contents", 
   source: TryContext { file: "/home/niedzwiedz/.cargo

 /git/checkouts/pdf-3ef1c528a9b91eec/9e56f00/pdf/src/content.rs",
  line: 237, column: 21, context: [("op.as_str()", "Ok(\"BI\")")], 
  source: MissingEntry { typ: "InlineImage", field: "ColorSpace" } } } } }', 
  invoices/src/parser.rs:155:47
s3bk commented 3 years ago

oh wow. an inline image. Will look into that as well.

Niedzwiedzw commented 3 years ago

I'm not creating the documents, and I can imagine the standard compliance for pdf is a MESS. for some context, I'm trying to salvage what I can from some government generated documents :D

Niedzwiedzw commented 3 years ago

@s3bk https://github.com/sbeckeriv/lopdf/blob/master/src/nom_parser.rs would this be useful to you at all?

s3bk commented 3 years ago

I don't think we are going to switch to nom. It is great, but PDF is a mess and we already have a handwritten parser.

s3bk commented 3 years ago

The PDF Reference lists ColorSpaceas a non-optional field of inline images. And I have no intention of allowing various derivations from the specification as that is a hole without bottom.

s3bk commented 3 years ago

@Niedzwiedzw you are in luck. The color_spacefield is an Option, so I went ahead and made it optional in inline images.

Niedzwiedzw commented 3 years ago

so cool thank you so much @s3bk

Niedzwiedzw commented 3 years ago
thread 'parser::lotos::test_parser::test_example_files_parse' panicked at 'bad page?:
 Try { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/d09d20e/pdf/src/file.rs", 
line: 94, column: 19, source: 
Try { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/d09d20e/pdf/src/object/types.rs", 
line: 22, column: 42, source: FromPrimitive { typ: "Option < Content >", field: "contents", source: 
TryContext { file: "/home/niedzwiedz/.cargo/git/checkouts/pdf-3ef1c528a9b91eec/d09d20e/pdf/src/content.rs", line: 236, 
column: 21, context: [("op.as_str()", "Ok(\"BI\")")], source: MissingEntry { typ: "InlineImage", field: "Decode" } } } } }', 
invoices/src/parser.rs:155:47
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

hmm