jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
373 stars 61 forks source link

TBD: make Qiqqa cope better with flaky/damaged/b0rked PDF files #13

Closed GerHobbelt closed 5 years ago

GerHobbelt commented 5 years ago

Now that I have access to the Qiqqa source code and have been able to rebuild the binary and extend its logging, I find that quite a lot of my troubles in the past years is due to Qiqqa not coping well with all kinds of broken/b0rked PDF files in the Qiqqa libraries:

  1. 10: several PDFs caused Qiqqa to run indefinitely after closing it: every time I had to open the Windows Task Manager and KILL thr Qiqqa process or process tree to make it stop. If I didn't do that, Qiqqa would report it's already running when you restart it, necessitating a reboot. Instead, I've executed the Windows equivalent of kill -9 every time I exit/stop Qiqqa.

  2. 16: Qiqqa failed on several occasions with my large PDF collection, causing a permanent and total failure in its search feature, i.e. the Lucene database got nuked/b0rked. All subsequent searches in Qiqqa would deliver ZERO results, quickly.

    • Reindexing via the Qiqqa Tools panel would have no effect.

      Tools > Qiqqa Configuration > Troubleshooting > Rebuild Library Search Indices
    • Manually deleting all the Lucene DB files in base/Guest/index/ would also be to no avail.

    • Reconstructing the Library by importing the PDF files in tiny batches via the Directory Watch feature of Qiqqa would result in 'semi-random behaviour': it now turns out to be highly dependent on which PDF files got loaded first: as soon as an offending PDF (to be uploaded later) got included in the library, the Lucene-backed search facility would break down and stop to function.

    Note: Pending investigation suspects #11 at least; at the time of this writing #11 has been fixed and this was a required first step towards making the Lucene-backed search feature work and (re)generate a working search index once again.

  3. When using the sniffer (Yay! :smile: Superb Feature!) to fetch additional documents (PDFs), sometimes you'll observe a load failure, where

    • the document occurs as pure white multi-page document with no content at all, or
    • the document would render as a single pure-white page document with no content at all, or
    • the PDF download/fetch operation would lock up and you'ld have to kill -9 Qiqqa to stop it. Depending on the alignment of the planets, you'll be able to restart Qiqqa with a functioning or broken 'search' feature then. Waiting on http://website/path.../file.pdf would be shown forever in the status line at the bottom of the main window.
  4. There's no way to dig out these b0rked PDFs from the library and 'select all' the discovered culprits to apply some chosen user activity (delete PDF + library entry, export/dump to diagnostics directory, ...?what you want?...)

    • This is filed as #12, by the way.
GerHobbelt commented 5 years ago

Done as per #33.

Lots of commits related to this issue. This set surely won't cover all as I've had crashes in lots of places during testing a 20K+ library which has collected its own amount of cruft from the Internet and years of Qiqqa fails (Sniffer lockups, download b0rks due to connection failure and what-not, you-name-it 🤡 ):

Revision: dc740d77b3893262fac573523309a617a9c99389 fix/tweak FolderWatcher background task: make sure we AT LEAST process ONE(1) tiny batch of PDF files when there are any to process.

Revision: d59d6f0817b04d61883dafac52d27e4eec27cfd5 fix crash in chat code when Qiqqa is shutting down (+ code review to uncover more spots where this might be happening)

20190804.204351 INFO  [Main] Stopping MaintainableManager
Exception thrown: 'System.NullReferenceException' in Qiqqa.exe
20190804.204351 WARN  [9] There was a problem communicating with chat.
System.NullReferenceException: Object reference not set to an instance of an object.
   at Qiqqa.Chat.ChatControl.ProcessDisplayResponse(MemoryStream ms) in W:\lib\tooling\qiqqa\Qiqqa\Chat\ChatControl.xaml.cs:line 221
   at Qiqqa.Chat.ChatControl.PerformRequest(String url) in W:\lib\tooling\qiqqa\Qiqqa\Chat\ChatControl.xaml.cs:line 127
20190804.204351 WARN  [9] Chat: detected Qiqqa shutting down.

Revision: bab049966ccc758954cd5453972ff707d76ce1c1 code stability: Do not crash/fail when the historical progress file is damaged

Revision: da3f8531f0e0baf14a45c46db199b4160b6cb3bf corrected Folder Watch loop + checks for https://github.com/jimmejardine/qiqqa-open-source/issues/20: the intent here is very similar to the code done previously for https://github.com/jimmejardine/qiqqa-open-source/issues/17; we just want to add a tiny batch of PDF files from the Watch folder, irrespective of the amount of files waiting there to be added.

Revision: 7bd3ee663db8483ee5acccd1218e1863415df816 more work regarding https://github.com/jimmejardine/qiqqa-open-source/issues/10 and https://github.com/jimmejardine/qiqqa-open-source/issues/17: when you choose to either import a large number of PDF files at once via the Watch Folder feature or have just reset the Watch Directory before exiting Qiqqa, you'll otherwise end up with a long running process where many/all files in the Watched Directories are inspected and possibly imported: this is undesirable when the user has decided Qiqqa should terminate (by clicking close-window or Alt-F4 keyboard shortcut).

Revision: 53f2ca86bc5547888648ab70541999d0c573a981 code cleanup activity (which happened while going through the code for thread safely locks inspection)

Revision: 5dcda970514c20518d32d7575b747280af8fa24b https://github.com/jimmejardine/qiqqa-open-source/issues/18 work :: code review part 1, looking for thread safety locks being applied correctly and completely: for example, a few places did not follow best practices by using the dissuaded lock(this){...} idiom (https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/lock-statement)

Revision: 8a1d7660659079939e59be74bf3822ea6311a205 Fix https://github.com/jimmejardine/qiqqa-open-source/issues/17 by processing PDFs in any Qiqqa library in small batches so that Qiqqa is not unreponsive for a loooooooooooooong time when it is re-indexing/upgrading/whatever a large library, e.g. 20K+ PDF files. The key here is to make the 'infrequent background task' produce some result quickly (like a working, yet incomplete, Lucene search index DB!) and then updating/augmenting that result as time goes by. This way, we can recover a search index for larger Qiqqa libraries!

Revision: 72b8d2577aa02338df13fec585d68a46b199f1c2 dialing up the debug/info logging to help me find the most annoying bugs, first of them: https://github.com/jimmejardine/qiqqa-open-source/issues/10, then https://github.com/jimmejardine/qiqqa-open-source/issues/13

Revision: b3590395d89642835bf725d94b1bf8f6cea480de update existing Syncfusion files from v14 to v17, which helps resolve https://github.com/jimmejardine/qiqqa-open-source/issues/11

Warning: I got those files by copying a Syncfusion install directory into qiqqa::/libs/ and overwriting existing files. v17 has a few more files, but those seem not to be required/used by Qiqqa, as only overwriting what was already there in the Qiqqa install directory seems to deliver a working Qiqqa tool. :phew:

GerHobbelt commented 5 years ago

Closing and decluttering the issue list so it stays workable for me: fixed in https://github.com/GerHobbelt/qiqqa-open-source mainline=master branch, pending #15 / any maintainer rights/actions.