jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
380 stars 64 forks source link

Retain Textify, OCR progress, etc, in status bar? #409

Open SimonDedman opened 1 year ago

SimonDedman commented 1 year ago

Notwithstanding a status line is absolutely useless for reporting these processes, is there any scope to have them live there permanently? Possibly one section per process - textify, OCR, etc (I can't remember the others)?

This relates to an underlying question: are these processes always running in the background? I'm now running Qiqqa in a Win10 virtual machine and it's working fine I guess, but textify & OCR is taking forever (since I rebuilt my index to hopefully allow search to work). The status messages flash up seemingly at random, and currently seem to just flash the same numbers (2729 text, 7 OCR, for the last 20+ mins)

I've given it 4 CPUs - do you think giving it more would help, or probably not? FWIW it doesn't seem to be using all of the 4 (per task manager in windows, and my CPU usage bars in linux).

Possibly related.

Also relates to this, since I've got this huge backlog the UI doesn't update when I Sniffer new pdfs.

And this, which will hopefully resolve itself once (if) all those new papers have been textified, OCRs, manually tagged, cleaned, etc.

How's progress with the project btw? Seems like it's become ever more complicated and nightmarish with every element you've investigated. Hope for the future? Cheers bud!

SimonDedman commented 1 year ago

Edit: another field: "N pages are searchable, N still to go"

GerHobbelt commented 1 year ago

πŸ‘ Yeah, the status bar in the current UI is, frankly, a mess. Jumping around, illegible unless you know what you're looking for and even then you need 20:20 eyesight and a Top Gun license to spot it. And I only added to that mess during my 'reign'. πŸ‘Ž

I'm loath to update the current UI as I don't see an enduring path forward there. WPF is out and I have had enough questions about Qiqqa-on-Linux-or-other that I know for sure that the current codebase won't ever be able to deliver that (which is yet another driver for the decision to re-do the app). So this ends up on the waiting list, regrettably.


On to your other questions & notes:

If Qiqqa keeps yakking about the document numbers in the status bar for a long time (say half an hour or more), then you might have run into a possible work queue prioritization bug (I do hope not! But reality has a way of catching me unaware some times), which should be (with some effort) discernable from the log files produced by Qiqqa. The 'way out' of that conundrum is re-starting Qiqqa, but I'ld be a bit careful about that as everything that's been flagged "no success, but let's not try that again, shall we?" will be re-issued as a work item, so a Qiqqa restart is more an attempt-to-workaround than a fix. Anyway, having a look at the log files (and the way they grow, too) can help giving you an idea what's happening under the hood. (You may want to use text search/filter tools such as Unix grep to filter the logfiles once you think you've found some interesting log lines in there, otherwise you'll be swamped in raw log output while you attempt to monitor/diagnose issues like these. I hope you're comfortable with a command line and tools like grep, because they help.)

TL;DR Mgmt. Summary

Re Qiqqa project & progress: I've had my doubts; my burn-out(s) -- probably a few more before I croak 🀷 -- but the grand total sum is: I have to go on an make it happen: πŸ˜‰ QiqqaNG, i.e. Next Generation, which fixes all major problems (this is not a joke!) by circumnavigating the entire cesspool by rewriting the whole tool in dev environments that I'm comfortable in. Bye bye, the C# + WPF combo.

There's 2 years of (outwardly) glacial progress to date, which, as a historic KPI, extrapolating from it, isn't a, Γ€h, strong indicator for early beta's arriving soon, so me saying 'Q4 2023' is, Γ€h, optimistic with reasonably high probability, given we only have one FTE (moi), no (sponsored) dev team. πŸ˜…

On the positive side, there's clear intrinsic motivation still abundantly present, despite a plethora of setbacks, so there may be a future for Qiqqa. Bug support? yes, I try. Fixes? not so much, as that is considered a dead end in the current state and eating up time and effort budget that's otherwise assigned QiqqaNG, so only glaring issues which are solvable in reasonable time will make the cut for now.

@\all: Any help appreciated, by the way, as always. πŸ˜‰




Totally & utterly off-topic: heck, I should deliver a UI with a slider where people can pick the level-of-detail they want to read, so they can dial the detail & detour amount up & down to their liking. 🀦 Yes. The hardest part of writing is cutting. πŸ™‡

[^1]: Qiqqa was horribly slow starting up and showing some initial window and hand control to the user, when I would start the background processes immediately. So I stole the Microsoft Windows 10 solution for being 'quick to boot' ( = less slow): postpone anything that's non-critical until the start-up sequence has completed. One of the notable costs in Qiqqa is it's effectively[^2] loading the entire metadata database into memory (RAM) and given the way this was coded it's definitely not quick when your libraries start to hit 10K+ document numbers, like mine.

[^2]: Qiqqa expects each library to know about its documents. So it fetches the list of documents in that library. Each document is kept as a .NET object (trivial). What's not trivial, is that it is done 'naively', i.e. each document initializes its object by loading its metadata from the database. It does this through serialization and the result is kept 'forever'. Which means the library init phase will load the entire library documents' metadata collection into RAM, thus effectively caching the database in memory. But it was not engineered to do that, it's a side effect, so speed of loading is okay, as long as you don't start hitting memory size issues in (forced-by-other-components) 32-bit .NET, where the .NET memory management starts having to work for its keep, resulting in increased cost. This is also why large Qiqqa libraries have a tendency to crash: that's the 32-bit .NET memory manager giving up the ghost under duress.

[^3]: When you 'nuke' the Lucene search database (/index/*.*), Qiqqa will have to re-index, i.e. load all extracted PDF text contents and feed it to built-in Lucene.NET again. This process should be smart enough to use the cached OCR data: Qiqqa keeps a global document content cache in the /ocr/ directory, where it stores the extracted text content for each and every PDF it processed, 20 pages per file iff you're lucky. If you're less lucky, the cache is per-page, which means Qiqqa had to perform OCR on that page to get that data. The watch-words here are should: when you (also) nuke the /ocr/ directory you'll be sure to have to auto-OCR all documents, but if you haven't, then Qiqqa should be smart enough to discover that the OCR work has been done before and load the existing cache files from /ocr/. Sometimes that logic fails due to circumstances. Circumstances such as QiqqaOCR.exe being unable to read the PDF file, extract text for every page of the PDF, or failure to produce enough text for a page to pass the internal sanity heuristics: Qiqqa is not smart about that and simply re-issues every failed PDF page extract (plus accompanying OCR request) for every PDF and PDF page that previously failed. "Previously" as in: the last time you ran Qiqqa. Thus this re-issuing is only done once per Qiqqa session, but the bad news is this re-issuing is re-triggered every time you start the Qiqqa application again. Plus there's possibly still some buggery in there when those text extract and OCR processes don't deliver as expected (somewhere in the range of 5-10 legible words per page; I'ld have to check the source code to verify this). Part of what you're observing is Qiqqa being stubborn: "Oh! let's get another 20 pages of PDF text extracted!" -> QiqqaOCR. Fail? --> for page=+1 to +20 do: "Oh! let get the page content, shall we?" --> QiqqaOCR. Fail? --> "Oh dear! Let's do OCR this page this time!" --> QiqqaOCR. Fail? --> "Eh, are you sure? Let's do that OCR thing again, in a bit another way, perhaps?" --> QiqqaOCR. Fail? --> "Bugger it! Ditch the bloody page! Make it walk the plank!" (until the next time you start Qiqqa, where its background processes will again discover: "Oi! That page hasn't been processed yet, because I don't see no sensible text! Do the extract process, will you?" -- which led to me 'rephrasing' that last OCR attempt's failure as: "Bugger it! I don't want to see the bloody page again! Produce a couple of absolutely non-sensical QiqqaWhatever "words", declare them sane content and call that a success! Now file it in the search database already so you won't bother me with ill-fated re-issued attempts the next time I visit!" Obviously, that could have been done a little 'smarter'. At the cost of additional development time, so here we are. 😒

[^4]: potential ease as right now, Qiqqa is very much uncooperative with me, having a 70K+ document library, spanning about 30 years of work, consulting & related interests. For me, Qiqqa fails to produce a search index (32bit .NET crashes, joining hands with antique Lucene.NET in failing to survive a re-index due to severe memory constraints and consumption by the components involved), so that is/was a strong motivation to investigate 'external' FTS (full Text Search) solutions, such as Apache SOLR. While SOLR is beautiful, I cannot, in good conscience, "sell" that one as part of Qiqqa as it only "works" (like Elastic, etc.etc. -- nothing SOLR-specific here!) when you have or obtain expert knowledge about how to "fine-tune" these beasts: I don't need sharding on my hardware and with my library, but I have found that getting the search results I hope to get is quite costly in time spent on "tuning" the SOLR/Lucene rig. As this is a generic problem, or at least I consider it a non-specific issue, then logic tells me I won't get better results by picking something else. The counter there is that with some other systems I'm more able & willing to spend the time: I'm not a big fan of (coding in) Java, for one. While SOLR is easy in some aspects, it's still pretty hard to diagnose the entire indexing process, f.e.. Hence current effort is spent on finding out the tough spots with Sqlite3's FTS5 and (waiting on the shelf for me) manticore.search, both of which are fully Open Source.[^5] Those I can much more easily bundle with Qiqqa as a product for non-tech-savvy customers, hence to most folks using Qiqqa. (Heck, when I am using Qiqqa, I don't want to be a technician/administrator about it, I just want to use the tool because my focus is elsewhere then. So Qiqqa is only viable if it's easy to set up on an arbitrary personal machine, at least. While it got close, SOLR didn't make that particular cut, not for me at least. 😒 Hence the way forward for me, at least, is to produce a new working backend for Qiqqa, which provides the current search/indexing abilities at scale (70K+ documents and counting; the test corpus I use to test my mupdf/tesseract/etc. work is 300K+ docs / 12TB. Rediculous sizes perhaps, but those have uncovered some quite mesmerizing failures in both old (current Qiqqa) and new (GerHobbelt/mupdf) code, which turned out to be pretty severe -- a lot of 'fail' goes silently unnoticed, but sometimes you are 'lucky' to get a hard failure, a noticable crash, which (if lucky!) is reproducible too. One of the bugs I found in the new code (and waiting in the old QiqqaOCR code, if & when I fix several issues there first) was a real nasty race condition, which only fired semi-randomly for some document combo's when maxing out the CPU, and turned out to be a waiting disaster for all PDFs with a certain (quite common) image type embedded in their pages. That one took about two weeks to diagnose sufficiently so a potential fix could be tested. Qiqqa is simple stuff, until you lift the carpet and look. )

[^5]: one of the lessons that I had already learned a long time ago in my professional career, but unfortunately had to re-encounter again with Qiqqa, is that any commercial software libraries are an effin' curse on your application viability and lifetime/life cycle. SORAX which was used for PDF viewing and some PDF processing, is long defunct and gone. Large chunks of the Qiqqa GUI are done with Infragistics 'nice to have' SDK, which still exists, but I am utterly unwilling to cough up serious dough for something that I won't be using commercially or repeatedly: the choice to use WPF (instead of MFC or WinForms at the time Qiqqa was created) is another nail in my coffin and buying a developer Infragistics license would only be me telling myself my new phase in life is BDSM-happy and digging some serious savaging. Njet, tovaritch. QiqqaOCR uses further components (old tweaked mupdf, but no full source. old Lucene.NET with a similar 'source code is there, but which one is it?' Tombola) which were not published full source, and subsequent research didn't produce a built-from-amalgamated-source working equivalent either. The lesson: either you provide a full source tree and some means and directives to recompile/rebuild the bugger, or you're out. Too much hassle debugging/diagnosing deep system failures that way. That nasty bug I found took 2 weeks with a source code and supplements available. It would have ranked 'unsolvable in any reasonable timeframe' if I hadn't had full (editing!) access to the entire sourcecode while I was hunting down that issue. So: commercial software SDKs are only okay when I get paid by the hour for it. Otherwise, it's Open Source only. -- πŸ€” pitty ABBYY isn't Open Source: I've seen some very decent OCR work come out of that one for files that tesseract and Nuance barfed on. Alas. 🀷

[^6]: a.k.a. academia πŸ˜‰

SimonDedman commented 1 year ago

Thanks for this update mate, sorry for my slow reply.

Huge respect for the work you're putting in here, feels like you're building something which will be a massive asset for the whole community for the rest of our working lives.

Thanks, as always.