Open GerHobbelt opened 3 years ago
Conclusion after tonight: Lucene.NET is out. Too much effort; can't mess with it without breaking. surely will be me and my ways or whatnot. I don't mind. My time is better spent on kicking up a real SOLR instance and kicking its tires, learning to get that one flying with Qiqqa. There's where I want to go with this whole endeavour anyway: opened up search access so folks can do their own creative processing of the PDF content and metadata fed into the engine by Qiqqa: Qiqqa shouldn't be the only channel into your metadata.
Thinking about #261 and other 'complexities' here.
Related: #23
Given #283 and a lot of other issues (haven't taken time to search issue database right now), this is a needed effort.
(TODO: edit this issue to include links to the relevant issue numbers below)
Unfortunately the libraries which keep us limited to 32bit .NET (and thus upper limit of ~ 1-1.5GB RAM usage) are both UI libraries: SORAX for PDF and XULrunner for the "embedded browser" used in the Qiqqa Sniffer (and a few other places in the software).
Another problem library is the old Lucene.NET we're still using.
The UI problems surface when using (very) large libraries and out-of-memory issues pop up ever so often.
Key idea developed during 2020 is to open up Qiqqa and split it up into separate components:
The Qiqqa 'middleware', i.e. the Qiqqa core functionality, which now exists in the application and is tightly bound to the UI code in some places, is to be cleaned up a bit and separated out into a 'local server' component without any UI.
Core -> Qiqqa server
The UI is thus converted into becoming a 'client' of said 'Qiqqa server' and should be as thin as possible: PDF rendering, for example, should be done by this (or another; see next item) server and image data transmitted to the client. (Named pipes or via sockets, as we want this interface to be cross-platform portable)
WPF UI -> WPF client UI ( -> Electron )?
PDF rendering is to be done in a separate process using MuPDF, which is also the tool used to provide PDF text extraction and PDF OCR (as Tesseract is now being integrated into MuPDF mainline and we follow along closely there)
Qiqqa SORAX + QiqqaOCR -> PDF I/O in separate local server
Search index, now managed through Lucene.NET, can be migrated to SOLR: I've considered staying with Lucene.NET but it lags behind and is not seeing lots of development, while the Java-based Lucene engine + SOLR are mainstream and see lots of use & attention. ElasticSearch instead of SOLR is an option, but given the descriptions found on the Net I believe I'ld better go with SOLR. Either way, this means a qiqqa install would then include the install of the Java runtime, which isn't needed now.
Lucene.NET search index -> SOLR
Citations are a convoluted bunch right now as they use the XULrunner for running some JavaScript, which uses the CSL styles from Zotero et al, plus citation.js IIRC, but then there's the finicky interfacing with the C# .NET code: back in the day when Qiqqa got started this was a viable and probably even best solution as there wasn't much NodeJS, etc. around at the time; or at least that would require a more convoluted install process. 2020/2021 AD, this should be easier to deliver and since I want to get away from WPF anyway, Electron/NodeJS is a potential direction.
Citation generation via XULrunner -> NodeJS + citation.js + CSL + ???
Qiqqa Sniffer is an integral part of the UI, but must be addressed separately: quite a bit has been written and researched regarding Google Scholar and the CAPTCHAs, etc. The bottom line here is that there are two paths to a potential solution either upgrade the embedded browser to something (very) modern and easily upgradable as Google keeps on restricting access to the Scholar database, which will never get an open access API if I read the entrails right: Google Inc. has a negative benefit outcome when they'ld do that.
The choices we have here are
Sniffer -> upgrade it or make a browser addon? (preference: upgrade)
Ditto for the embedded PDF reader, which offers a text and annotations editing overlay: currently SORAX is doing the PDF rendering for us, but that has to be moved to another library (MuPDF has been selected for that one). And then new one has to be linked to the main app in such a way that we still have an option at good UI performance while not being stuck in 32bit-only for the .NET code.
Before we go there, there's one thing on my mind that I haven't checked yet:
How much .NET memory is gobbled up by the Lucene search databases in current Qiqqa?
When you have a very large lib (40-50+K PDFs) I notice memory consumption quickly rising to ~ 1GB and then performance being reduced more or less (due to frequent GC (Garbage Collect) actions from .NET) and ultimately out-of-memory fatal errors when you're unlucky. (#283 f.e.)
What I must checck is: does it help significantly if I move the Lucene/Search Index work out of process? No need to immediately reach for SOLR there, but maybe I can come up with a minimal bit of work to arrive at a similar scenario (search engine as local server == out-of-process), where Qiqqa core app *communicates* with the search engine instead of incorporating* it...