jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
366 stars 60 forks source link

replace old PDF text extract tool `pdfdraw.exe` with new mupdf tooling: old one goes to 100% CPU load for several (nasty) PDFs` #322

Open GerHobbelt opened 3 years ago

GerHobbelt commented 3 years ago

Example in "evil corpus" library: the exact command line that locks up one core (several others show same behaviour)

"pdfdraw.exe"  -tt   "G:\Qiqqa\evil\Guest\documents\1\1356743483BEE1F1E828DE9613F6F481FF15B87A.pdf" 1,2,3,4

While old mupdf and patch have been provided when qiqqa went open source, I haven't been able to recreate that pdfdraw.exe binary from that.

Meanwhile mupdf has moved on (quite a bit) and new output formats for extracted text are available. (hOCR, etc.) This means extra migration effort but this one is becoming quite pressing as it locks up on many PDFs that way, also in my regular libraries.

The nett effect is that the fans start when running Qiqqa and never stop spinning while the machine becomes slower and slower as more pdfdraw keep running in the background and never terminate, also not once Qiqqa itself is terminated / exited / quit.