jrmuizel / pdf-extract

A rust library for extracting content from pdfs
396 stars 78 forks source link

Less panics, add error handling, add tests, re-export lopdf, linting, readme #48

Open joepio opened 1 year ago

joepio commented 1 year ago

Hi there! This PR turned out a little larger than I anticipated, but I think it bings some important improvements:

This PR also makes some other open PRs unnecessary: #25 #45 #44, also has overlap with #49

It still requires some work though. @jrmuizel could you help out to fix the last of the types?

Some open questions / to still be improved upon:

joepio commented 1 year ago

@jrmuizel Could you take a look at my PR? Let me know if anything requires changing. Also, if you'd like help with maintenance of this repo, let me know.

jrmuizel commented 1 year ago

I'll try to take a look this week

joepio commented 1 year ago

@jrmuizel I've made some additional changes this weekend. Let me know if there's anything I can do to help get this merged.

jrmuizel commented 1 year ago

Thanks for your contribution. I finally had a chance to look at this and there's a lot here. I don't have a lot of time for this project so it would be easier for me if you broke it up into individual PRs that I can review and merge incrementally instead of one big one.

A couple of things:

I'm not a huge fan of the mass unwrap removal. Without a fix for https://github.com/rust-lang/rust/issues/54144 it's a lot more painful to debug a propagated error instead of a panic. I wouldn't really recommend running this software in production but if you really want to perhaps you can use catch_unwind. If you pdfs that trigger particular panics I'm happy to fix those on a case by case basis.

Regarding testing, I don't really want to bloat the repo with large test PDFs. Perhaps a good option would be to do what pdf.js does and use urls in pdf.link files (https://github.com/mozilla/pdf.js/tree/master/test/pdfs).

I'd also prefer the changes in the clippy commit be broken out into individual fixes.

Sorry if these requests come across as a bit particular. If you feel differently about them or don't want to spend more time on these changes I totally understand you keeping your own fork. In the mean time I'll continue going through your changes trying to pull out the parts that look good to me.

joepio commented 1 year ago

@jrmuizel Thanks for the thorough response! And good to see you've already merged the readme change.

Without a fix for https://github.com/rust-lang/rust/issues/54144 it's a lot more painful to debug a propagated error instead of a panic

Not sure if I understand this correctly, but I think there are some other approaches that could give you more context while debugging. For example, the anyhow crate provides stack traces within normal Results, using RUST_BACKTRACE=1 and the backtrace feature. Or you could create a custom error, and set a breakpoint in the code related to instantiating that custom error. I really think keeping the .unwrap is not the right call for a published library.

I'd also prefer the changes in the clippy commit be broken out into individual fixes.

Well, that's the first commit. You could cherry pick that one perhaps?

Regarding testing, I don't really want to bloat the repo with large test PDFs. Perhaps a good option would be to do what pdf.js does and use urls in pdf.link files (https://github.com/mozilla/pdf.js/tree/master/test/pdfs).

If we re-fetch all pdf files every test run, testing becomes thousands of times slower, which harms productivity. However, we could store them in a local directory that is in .gitignore.

If you feel differently about them or don't want to spend more time on these changes I totally understand you keeping your own fork

Well, it all boils down to the error handling. If you really want to keep the panics, I think I'll work on the fork.

0xpr03 commented 1 year ago

Without a fix for https://github.com/rust-lang/rust/issues/54144

I'd recommend using something like thiserror, which has support for stabilized Backtraces, while allowing for specific error kinds. So you can just include a backtrace for every error. Otherwise you may use something like the SNAFU crate, which allows backtrace under specific flags, so you can restrict them to debug builds.

I wouldn't really recommend running this software in production but if you really want to perhaps you can use catch_unwind

I strongly recommend against using catch_unwind as some kind of try-catch method for general exceptions. Panics are supposed to be unfixable situations, a PDF parser panicking my complete application by reading some sort of weird PDF is definitely an unhappy surprise (and a major DoS attack path). Even for hobby projects I personally avoid any libraries that do this.

You're definitely free to do otherwise, just voicing my 2 cents.

HarrisDePerceptron commented 1 year ago

this is such a stand thing to do in a library these days. panics really make this unusable. anyhow + thiserro + ? would be great.

wdoppenberg commented 1 year ago

Any word on this? I would love to have proper error propagation. Currently having to do catch_unwind and suppressing println statements for Unicode mismatch is not very idiomatic Rust I'd say.

Christof23 commented 6 months ago

@joepio @HarrisDePerceptron @0xpr03 couldn't agree more on the need for stable libraries and on the use of anyhow + thiserror. As others have stated I'm also hesitant to incorporate this lib into production code if it will end up crashing somewhere down the line.

Is there any update @jrmuizel on whether this can be integrated soon?