jrmuizel / pdf-extract

A rust library for extracting content from pdfs
396 stars 78 forks source link

Error handling - replace `.unwrap` and `panic` with `?` #47

Open joepio opened 1 year ago

joepio commented 1 year ago

Hi there! Thanks for creating an maintaining this :)

I like what this crate does, but I can't really use it as a library in my project because there are many (quite common) paths that will panic. I want to handle errors, not crash my app.

I suggest we change all .unwrap, .expect and panic with ? and Err. We add a custom Result type and implement impl From<&str> for Err for convenience.

UPDATE: I've opened a PR that changes most of the project to return results.

griccardos commented 1 year ago

Agreed, great library, with the potential to be very useful. It works so well so often! Unexpected panics however limit the usefulness even if it only occurs 1 in a 1000 times. PDF is complicated, so a perfect library may never be possible, but the ability to ignore errors if one wishes in their own project is essential. The use of errors bubbling up gives us this ability.

scandox commented 1 month ago

I agree. I have an app processing millions of PDFs, extracting text from them. For my use case if the text cannot be extracted I just move on to the next without hard feelings. With this library I can't take this approach.

jrmuizel commented 1 month ago

Please share any pdfs that panic. One of the main reasons for tending toward panic instead of returning an error is that it increases the likelihood of people reporting pdfs that cause panics.

scandox commented 1 month ago

That is a good strategy! Unfortunately most of what I deal with are not shareable on the face of it. However I will look into whether there is some way to do it. Certainly it would be nice to contribute to a more robust extractor.

jrmuizel commented 1 month ago

@scandox can you share some of the panic locations/messages?

joepio commented 1 month ago

Please share any pdfs that panic. One of the main reasons for tending toward panic instead of returning an error is that it increases the likelihood of people reporting pdfs that cause panics.

This is useful. I think it would make sense to add a panic wrapper that adds an explanation to people to:

scandox commented 1 month ago

Please share any pdfs that panic. One of the main reasons for tending toward panic instead of returning an error is that it increases the likelihood of people reporting pdfs that cause panics.

It looks like there are conditions under which I can share PDFs but not as part of the public repository. If you like you could contact me directly. My email is my username at the well known Google owned email service.

I have about 650 ones that panic. About 22K that cause various unknown errors.