jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
366 stars 60 forks source link

Should I remove PDFs which always (re)trigger a PDF OCR action? #345

Open mauwig opened 2 years ago

mauwig commented 2 years ago

Hi,

I constantly get this kind of log when my qiqqa crashes:

20210716.231358 [Q] WARN [Daemon.Maintainable:BackgroundWorkerDaemon.DoMaintenance_Infrequent] [731.036M] LibraryIndex::IncrementalBuildNextDocuments: PDF document 5191DE72F37F3F7DC87B55A19F82E83C5DC0A6A7: pages 1,10,341-344 have no text (while pages 2-9,11-340 DO have text!) and will (re)trigger a PDF OCR action. This is probably a document which could not be OCRed properly (for reasons unknown at this time).

And usually there are dozens of PDFs that are reported in the 'recent log' for every crash. I understand that this is caused by these pages not being possible for OCR or by not having any text at all, so qiqqa will always retrigger OCR for these pdfs, unsuccessfully. Is such retriggering process being forced by qiqqa even when OCR is disabled?

My main questions is, would qiqqa crash less often if I remove all these apparently troubling PDFs from my libraries, or would that make little or no difference in performance?

Thanks for any support!

GerHobbelt commented 2 years ago

20210716.231358 [Q] WARN [Daemon.Maintainable:BackgroundWorkerDaemon.DoMaintenance_Infrequent] [731.036M] LibraryIndex::IncrementalBuildNextDocuments: PDF document 5191DE72F37F3F7DC87B55A19F82E83C5DC0A6A7: pages 1,10,341-344 have no text (while pages 2-9,11-340 DO have text!) and will (re)trigger a PDF OCR action. This is probably a document which could not be OCRed properly (for reasons unknown at this time).

I am very interested to receive any PDFs like that, so if those are not sensitive material and you're okay with sharing them, then please do send them to me (email me (ger at hobbelt.com) a link to a store, e.g. Google Drive folder or otherwise) so I can have a look!

Also please tell me then if those PDFs can be shared publicly as they will then be added to my PDF test corpus so they can be re-used when we run large bulk tests to test the stability and ability of the PDF software tools (currently under development). Corpus: https://github.com/GerHobbelt/Evil-PDF-Library-for-Qiqqa

And usually there are dozens of PDFs that are reported in the 'recent log' for every crash. I understand that this is caused by these pages not being impossible for OCR or by not having any text at all, so qiqqa will always retrigger OCR for these pdfs, unsuccessfully. Is such retriggering process being forced by qiqqa even when OCR is disabled?

My main questions is, would qiqqa crash less often if I remove all these apparently troubling PDFs from my libraries, or would that make little or no difference in performance?

Here it becomes a little complicated....

The simplest answer would be 'yes', but I think a bit of an explanation is in order, because the real answer is: 'it depends'.

Ok. Current Qiqqa v82 and v83 releases still use a lot of outdated chunks of code and libraries ('legacy software'); experience has shown that there are broadly three categories of trouble with PDF when it comes to Qiqqa:

The unlisted 'fourth' category are (very rare by now) PDFs which happen to trigger the new mupdf software to fail.

All four types are the reason why I created that test corpus as it turned out that PDF rendering (showing on screen) and processing was quite brittle and I don't want the new tools, which are to replace the old ones in Qiqqa, to be flaky.

❗ This is also why I am very interested in obtaining any obnoxious PDF so I can add it to the corpus so we have a flying chance of ensuring the software quality will improve and stay high over time. (That's not including my own intermittent eff-ups, of course. 😉 )


Then on to what you're seeing with the QiqqaOCR error reports:

One of the mechanisms I added to Qiqqa when working on coping with bad PDF behaviour of the software, including rotten OCR activity, is to 'mark' any PDF that fails to produce legible content text in those background processes, which feed your qiqqa Search Engine (Qiqqa uses Lucene.NET for that, today).

Some PDFs are very obnoxious and the earlier mentioned brittleness comes into play: the way I coded this in v82 and v83 is to have Qiqqa go through the library and Watch Folders (it did that anyway) and retry text extraction on each PDF that's incomplete (as older qiqqa also did already). The difference between v82+ and old (commercial + v80/v81) Qiqqa is that the old ones would retry each troublesome PDF ad nauseam, resulting in high CPU loads while getting stuck on PDFs that wouldn't budge. v82+ marks and mentions the failures in the log and then moves on to the next PDF in the list.

Now to make sure those PDFs get an automatic chance to be processed again, that 'mark' list is only active while Qiqqa is running, so when you close and open qiqqa gain, the list is constructed afresh to make sure we don't miss or partially skip over any PDF which has been processed incorrectly or incompletely before for whatever reason, including crashes.

That's where the complicated bit is at: I rather keep that general behaviour, so that any change in the software will immediately be able to improve the libraries without the need for user intervention, while there still are PDFs out there which crash the current Qiqqa software: those PDFs have to be removed from the library to make qiqqa "stable" unfortunately. Not a fun job, I'm quick to admit. 😓

I hope I cleared that up and did not add more fog to this complex subject. (Complex because ideally I'ld like to keep as many, even possibly slightly damaged, PDFs in a user's library for they got in there for a purpose. Hence my goal to make Qiqqa robust against all this (hence the focus on that large corpus!) so that you can download anything and not worry this kind of bothersome crap.

TL;DR

The way I would approach this today is look in the log for the PDF file paths near the time of a crash and see if there's recurring ones in there (the 'recidivists', so to speak 😉 ) and move those out of the /documents/ library subdirectory and put them somewhere qiqqa won't be looking for them -- the reason being that then I'ld be able to re-introduce them when another Qiqqa release happens and want to check if qiqqa is now able to process them properly, as intended.

Thanks for any support!

I'm not available every day, regrattably, so sometimes it's slow going but I hope the above is a useful answer.

If you have any answers or comments / feedback, don't hesitate!

Cheers and happy hunting, 😓 😉

Ger