jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
373 stars 61 forks source link

When re-indexing a large library, Qiqqa is unresponsive for a VERY long time (too long to wait: 1+ hours) #17

Closed GerHobbelt closed 5 years ago

GerHobbelt commented 5 years ago

20K PDF library. Coming from a v79 commercial install, this library has suffered badly from #16 in the past and a recompiled Qiqqa (with #11 fixed and #13 partly fixed) would now (#14) finally attempt to recreate that Lucene-backed search index, only to end up as 'Not Responding...' and spitting out several MBytes of logfile output carrying a zillion lines like these:

20190802.104554 INFO [Daemon.Maintainable:BackgroundWorkerDaemon.DoMaintenance_Infrequent] Indexing document E6B963888DF9A4CCD5E2CD7647BFE94F692DF1

20190802.104554 INFO [PDFTextExtractor] PDFOCR:297 page(s) to textify and 1254 page(s) to OCR. (1/1551)

GerHobbelt commented 5 years ago

Done as per #33.

See also #20. Do note that this work does not stand alone and is highly related to #18 et al.


Commits:

Revision: d58bd7aed030e17361752ce539373aad68e8f973 revert debug code that was part of commit SHA-1: 89307edfe7d5ba2b6de050de969d2910b147e682 -- some invalid BibTeX was crashing the Lucene indexer (AddDocumentMetadata_BibTex() would b0rk on a NULL Key)

That problem was fixed in that commit at a higher level (in PDFDocument)

Revision: da3f8531f0e0baf14a45c46db199b4160b6cb3bf corrected Folder Watch loop + checks for https://github.com/jimmejardine/qiqqa-open-source/issues/20: the intent here is very similar to the code done previously for https://github.com/jimmejardine/qiqqa-open-source/issues/17; we just want to add a tiny batch of PDF files from the Watch folder, irrespective of the amount of files waiting there to be added.

Revision: 7bd3ee663db8483ee5acccd1218e1863415df816 more work regarding https://github.com/jimmejardine/qiqqa-open-source/issues/10 and https://github.com/jimmejardine/qiqqa-open-source/issues/17: when you choose to either import a large number of PDF files at once via the Watch Folder feature or have just reset the Watch Directory before exiting Qiqqa, you'll otherwise end up with a long running process where many/all files in the Watched Directories are inspected and possibly imported: this is undesirable when the user has decided Qiqqa should terminate (by clicking close-window or Alt-F4 keyboard shortcut).

Revision: 8a1d7660659079939e59be74bf3822ea6311a205 Fix https://github.com/jimmejardine/qiqqa-open-source/issues/17 by processing PDFs in any Qiqqa library in small batches so that Qiqqa is not unreponsive for a loooooooooooooong time when it is re-indexing/upgrading/whatever a large library, e.g. 20K+ PDF files. The key here is to make the 'infrequent background task' produce some result quickly (like a working, yet incomplete, Lucene search index DB!) and then updating/augmenting that result as time goes by. This way, we can recover a search index for larger Qiqqa libraries!

Revision: b3590395d89642835bf725d94b1bf8f6cea480de update existing Syncfusion files from v14 to v17, which helps resolve https://github.com/jimmejardine/qiqqa-open-source/issues/11

Warning: I got those files by copying a Syncfusion install directory into qiqqa::/libs/ and overwriting existing files. v17 has a few more files, but those seem not to be required/used by Qiqqa, as only overwriting what was already there in the Qiqqa install directory seems to deliver a working Qiqqa tool. :phew:

GerHobbelt commented 5 years ago

Related: #55

GerHobbelt commented 5 years ago

Closing and decluttering the issue list so it stays workable for me: fixed in https://github.com/GerHobbelt/qiqqa-open-source mainline=master branch, pending #15 / any maintainer rights/actions.