jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
380 stars 64 forks source link

search index stops working and re-indexing doesn't recreate the Lucene search db #16

Closed GerHobbelt closed 5 years ago

GerHobbelt commented 5 years ago

I've had this problem many times over the years with a 20K+ docs db. (using v76-80 (github release))

GerHobbelt commented 5 years ago

From #13

  1. 16: Qiqqa failed on several occasions with my large PDF collection, causing a permanent and total failure in its search feature, i.e. the Lucene database got nuked/b0rked. All subsequent searches in Qiqqa would deliver ZERO results, quickly.

    • Reindexing via the Qiqqa Tools panel would have no effect.

      Tools > Qiqqa Configuration > Troubleshooting > Rebuild Library Search Indices
    • Manually deleting all the Lucene DB files in base/Guest/index/ would also be to no avail.

    • Reconstructing the Library by importing the PDF files in tiny batches via the Directory Watch feature of Qiqqa would result in 'semi-random behaviour': it now turns out to be highly dependent on which PDF files got loaded first: as soon as an offending PDF (to be uploaded later) got included in the library, the Lucene-backed search facility would break down and stop to function.

    Note: Pending investigation suspects #11 at least; at the time of this writing #11 has been fixed and this was a required first step towards making the Lucene-backed search feature work and (re)generate a working search index once again.

GerHobbelt commented 5 years ago

Done as per #33.

Commits:

Revision: d58bd7aed030e17361752ce539373aad68e8f973 revert debug code that was part of commit SHA-1: 89307edfe7d5ba2b6de050de969d2910b147e682 -- some invalid BibTeX was crashing the Lucene indexer (AddDocumentMetadata_BibTex() would b0rk on a NULL Key)

That problem was fixed in that commit at a higher level (in PDFDocument)

Revision: 89307edfe7d5ba2b6de050de969d2910b147e682 some invalid BibTeX was crashing the Lucene indexer (AddDocumentMetadata_BibTex() would b0rk on a NULL Key)

Sample invalid BibTeX:

@empty = delete?

Revision: 8a1d7660659079939e59be74bf3822ea6311a205 Fix https://github.com/jimmejardine/qiqqa-open-source/issues/17 by processing PDFs in any Qiqqa library in small batches so that Qiqqa is not unreponsive for a loooooooooooooong time when it is re-indexing/upgrading/whatever a large library, e.g. 20K+ PDF files. The key here is to make the 'infrequent background task' produce some result quickly (like a working, yet incomplete, Lucene search index DB!) and then updating/augmenting that result as time goes by. This way, we can recover a search index for larger Qiqqa libraries!

GerHobbelt commented 5 years ago

Closing and decluttering the issue list so it stays workable for me: fixed in https://github.com/GerHobbelt/qiqqa-open-source mainline=master branch, pending #15 / any maintainer rights/actions.