metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.03k stars 113 forks source link

timeout option #43

Open DanielRuf opened 4 years ago

DanielRuf commented 4 years ago

Hi,

pdfx is very helpful for us to analyze a few things. Thanks for creating pdfx.

But we have a small problem. When a pdf file contains much text pdfx / python only fails after the "too many recursions" error is thrown.

It would be helpful to have a max-timeout option to prevent that pdfx tries to parse files for 45 minutes and more (in our case).

And another small question: how could we scan / check many files at once in the best way? So far we run single pdfx commands from a bash script and wait until every command has finished. Using the & trick would cause some issues with the job scheduler of the OS and that the whole OS freezes.

metachris commented 3 years ago

Could you post the full stack trace, and perhaps an example PDF? Please reopen the issue with those, thanks 🙏

DanielRuf commented 3 years ago

Please reopen the issue with those, thanks

Only you can reopen the issue ;-)

Here is an example file:

54013162437.pdf

This stacktrace is produced:

pdfx.log