jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
374 stars 61 forks source link

several PDFs caused Qiqqa to run indefinitely after closing it #305

Open GerHobbelt opened 3 years ago

GerHobbelt commented 3 years ago

Continuation of #10, in a sense: different culprit, same pack of background tasks.

Now it turns out old pdfdraw -tt (see also #34: this bugger has to go) is locked up forever at max CPU for spurious / egregious PDFs. (🎅 isn't English language fun 🎅 ho ho ho! 🤡 )

That's the text extraction background process going b0rk b0rk b0rk on you. No way out but hard "kill process" for each of these.

Targeted fix

Upgrading/migration to latest MuPDF mudraw hOCR or JSON STEXT output -- the old pdfdraw that comes with current Qiqqa installs is an antique patched MuPDF tool (#34 + #35) and lots have changed since then, including the relevant output format for extracted text.

As I intend to support more document types (via the hOCR/HTML fundamental format), Qiqqa should grok the new pdfdraw -o *.ocr.html or similar output.

Also keep in mind the migration from the antique (obsoleted) LuceneNET version to SOLR / ElasticSearch: that's #23 + #298 + Technology areas and their function in Qiqqa + Towards migrating the PDF viewer / renderer / text extractor