Migrate Qiqqa to 64 bit architecture to cope with large libraries, etc. (Future Plan)

GerHobbelt commented 3 years ago

Given #283 and a lot of other issues (haven't taken time to search issue database right now), this is a needed effort.

(TODO: edit this issue to include links to the relevant issue numbers below)

Unfortunately the libraries which keep us limited to 32bit .NET (and thus upper limit of ~ 1-1.5GB RAM usage) are both UI libraries: SORAX for PDF and XULrunner for the "embedded browser" used in the Qiqqa Sniffer (and a few other places in the software).

Another problem library is the old Lucene.NET we're still using.

The UI problems surface when using (very) large libraries and out-of-memory issues pop up ever so often.

Key idea developed during 2020 is to open up Qiqqa and split it up into separate components:

The Qiqqa 'middleware', i.e. the Qiqqa core functionality, which now exists in the application and is tightly bound to the UI code in some places, is to be cleaned up a bit and separated out into a 'local server' component without any UI.

Core -> Qiqqa server
The UI is thus converted into becoming a 'client' of said 'Qiqqa server' and should be as thin as possible: PDF rendering, for example, should be done by this (or another; see next item) server and image data transmitted to the client. (Named pipes or via sockets, as we want this interface to be cross-platform portable)

WPF.NET client for UI is later to be migrated to Electron or alike; I'm getting fed up with WPF as it slows me down terribly and is not cross-platform at all.

WPF UI -> WPF client UI ( -> Electron )?
PDF rendering is to be done in a separate process using MuPDF, which is also the tool used to provide PDF text extraction and PDF OCR (as Tesseract is now being integrated into MuPDF mainline and we follow along closely there)

Qiqqa SORAX + QiqqaOCR -> PDF I/O in separate local server
Search index, now managed through Lucene.NET, can be migrated to SOLR: I've considered staying with Lucene.NET but it lags behind and is not seeing lots of development, while the Java-based Lucene engine + SOLR are mainstream and see lots of use & attention. ElasticSearch instead of SOLR is an option, but given the descriptions found on the Net I believe I'ld better go with SOLR. Either way, this means a qiqqa install would then include the install of the Java runtime, which isn't needed now.

A bit of a surprise (as I had misinterpreted the code when I looked at this chapter long time before) is that Qiqqa only uses Lucene to find a document or page hit, but the exact location of the hits is only determined through scanning the page once more by itself. This is a lucky bit in the sense that Lucene migration is easier (SOLR and elasticSearch do not output hit coordinates, they only provide a HTML-style 'highlighting' means which has then to be parsed and thus is error prone / finicky to get right all the time); the unlucky bit then is that complex search criteria are a No Go Area as then I would have to redo those searches on every page by myself for every 'hit' reported by Lucene/SOLR/ES, hence it would be useful to look into that highlighting business further anyway: the highlight tags can be customized in SOLR (and ES, I read) so that would be a start for I want to open up the Qiqqa database to 'open searches': power users may want to do their own thing on the collected texts and metadata.

Lucene.NET search index -> SOLR
Citations are a convoluted bunch right now as they use the XULrunner for running some JavaScript, which uses the CSL styles from Zotero et al, plus citation.js IIRC, but then there's the finicky interfacing with the C# .NET code: back in the day when Qiqqa got started this was a viable and probably even best solution as there wasn't much NodeJS, etc. around at the time; or at least that would require a more convoluted install process. 2020/2021 AD, this should be easier to deliver and since I want to get away from WPF anyway, Electron/NodeJS is a potential direction.

No idea if I can leverage Zotero or others' work into providing Citation insertion/updating/sync with documents being written in various editors, not just MSWord. The big new one for me is MarkDown based text editing; it's ages since I last wrote any (La)TeX and while I still appreciate that one, current needs haven't made that one return into personal demand. Meanwhile MSWord is both a boon and a bane and, regrettably, it's mostly bane. That goes double for the 'libre' versions of Word, at least for me. Others may want to differ, so some form of good citation process was and will be highly desirable.

That means we might want to look into doing this via node + citation.js + CSLs for formatting and proper communications of the metadata inputs and formatted results from & to the Qiqqa core / UI / MSWord-Writer plugin(s).

Citation generation via XULrunner -> NodeJS + citation.js + CSL + ???
Qiqqa Sniffer is an integral part of the UI, but must be addressed separately: quite a bit has been written and researched regarding Google Scholar and the CAPTCHAs, etc. The bottom line here is that there are two paths to a potential solution either upgrade the embedded browser to something (very) modern and easily upgradable as Google keeps on restricting access to the Scholar database, which will never get an open access API if I read the entrails right: Google Inc. has a negative benefit outcome when they'ld do that.

The choices we have here are
- upgrading the embedded browser to something very modern (cross-platform portable means MSEdge WebView2 is not a prime candidate) --> CEF, preferrably the raw stuff without C# wrappers as they always lag behind and see less dev & user attention: count the github and gitlab stars for a basis measure of importance and you know what I mean.
- ditch the Sniffer UI as it is and follow the citation mangers out there which go with the Google Chrome AddOns / Plugins approach. Which is cute, but the power of the Qiqqa Sniffer AFAIAC is the power combo of PDF view, copy&paste from PDF page, and dropping that into the Scholar browser while I can see both PDF and search results (and the PDF linked in those results once I click on that one, so my eyes can easily spot duplicates / differences while I got through the second Sniffer process of collecting (bibTeX) metadata. That means I would need to offer the rendered PDF pages and text page overlay for copy/paste from the qiqqa middleware server to the browser plugin if we were to take this 'AddOn/Plugin' approach.
Sniffer -> upgrade it or make a browser addon? (preference: upgrade)
Ditto for the embedded PDF reader, which offers a text and annotations editing overlay: currently SORAX is doing the PDF rendering for us, but that has to be moved to another library (MuPDF has been selected for that one). And then new one has to be linked to the main app in such a way that we still have an option at good UI performance while not being stuck in 32bit-only for the .NET code.

Before we go there, there's one thing on my mind that I haven't checked yet:

How much .NET memory is gobbled up by the Lucene search databases in current Qiqqa?

When you have a very large lib (40-50+K PDFs) I notice memory consumption quickly rising to ~ 1GB and then performance being reduced more or less (due to frequent GC (Garbage Collect) actions from .NET) and ultimately out-of-memory fatal errors when you're unlucky. (#283 f.e.)

What I must checck is: does it help significantly if I move the Lucene/Search Index work out of process? No need to immediately reach for SOLR there, but maybe I can come up with a minimal bit of work to arrive at a similar scenario (search engine as local server == out-of-process), where Qiqqa core app *communicates* with the search engine instead of incorporating* it...

GerHobbelt commented 3 years ago

Conclusion after tonight: Lucene.NET is out. Too much effort; can't mess with it without breaking. surely will be me and my ways or whatnot. I don't mind. My time is better spent on kicking up a real SOLR instance and kicking its tires, learning to get that one flying with Qiqqa. There's where I want to go with this whole endeavour anyway: opened up search access so folks can do their own creative processing of the PDF content and metadata fed into the engine by Qiqqa: Qiqqa shouldn't be the only channel into your metadata.

Thinking about #261 and other 'complexities' here.

GerHobbelt commented 3 years ago

Related: #23

jimmejardine / qiqqa-open-source

Migrate Qiqqa to 64 bit architecture to cope with large libraries, etc. (Future Plan) #289

How much .NET memory is gobbled up by the Lucene search databases in current Qiqqa?