SimonDedman commented 1 year ago

Notwithstanding a status line is absolutely useless for reporting these processes, is there any scope to have them live there permanently? Possibly one section per process - textify, OCR, etc (I can't remember the others)?

This relates to an underlying question: are these processes always running in the background? I'm now running Qiqqa in a Win10 virtual machine and it's working fine I guess, but textify & OCR is taking forever (since I rebuilt my index to hopefully allow search to work). The status messages flash up seemingly at random, and currently seem to just flash the same numbers (2729 text, 7 OCR, for the last 20+ mins)

I've given it 4 CPUs - do you think giving it more would help, or probably not? FWIW it doesn't seem to be using all of the 4 (per task manager in windows, and my CPU usage bars in linux).

Possibly related.

Also relates to this, since I've got this huge backlog the UI doesn't update when I Sniffer new pdfs.

And this, which will hopefully resolve itself once (if) all those new papers have been textified, OCRs, manually tagged, cleaned, etc.

How's progress with the project btw? Seems like it's become ever more complicated and nightmarish with every element you've investigated. Hope for the future? Cheers bud!

SimonDedman commented 1 year ago

Edit: another field: "N pages are searchable, N still to go"

GerHobbelt commented 1 year ago

👍 Yeah, the status bar in the current UI is, frankly, a mess. Jumping around, illegible unless you know what you're looking for and even then you need 20:20 eyesight and a Top Gun license to spot it. And I only added to that mess during my 'reign'. 👎

I'm loath to update the current UI as I don't see an enduring path forward there. WPF is out and I have had enough questions about Qiqqa-on-Linux-or-other that I know for sure that the current codebase won't ever be able to deliver that (which is yet another driver for the decision to re-do the app). So this ends up on the waiting list, regrettably.

On to your other questions & notes:

background OCR, text extraction, etc.: yes, those are 'always on' background processes. There's a lot of technical detail (about OS interaction, WPF, the way Qiqqa is coded, etc.etc.) which has resulted in a back&forth of me and CPU loading: it turned out that giving the background processes more 'power' (CPU load to consume), would make the whole WPF-driven GUI horribly unresponsive on almost all machines I tested Qiqqa on (only extreme hardware and enterprise level Windows didn't seem to suffer so much), so Qiqqa does a few things to keep those background processes a little 'throttled'. So, yes, the background processes are always running (after a multi-second delay at Qiqqa start)[^1]
re giving it more virtual CPUs: that only helps marginally above about 4 of them. It shouldn't be but that's the grim reality. I have here a rather fresh AMD rig with 16 cores and all its horses doped up on crystal meth (after my fancy-but-little-older dev laptop melted its motherboard last year 🔥🚒 ), willing to go, and the 'best' I've seen so far is about 30-40% load. Meanwhile one would expect all cores maxed out when extracting/OCR-ing, but there's some very hard-to-pin-down interplay between the GUI and the back-end application instances (QiqqaOCR.exe) starting up and running a job from the work queue.

I am still unsure, but it looks like the MSWindows scheduler and Qiqqa's 'use' of it by simply kicking off a new QiqqaOCR.exe executable for every job that's activated results in some interplay which prevents the system for solidly maxing out the available cores -- I suspect the start-exe action is somehow interlocked with the OS level/WPF UI and thus has to 'go through' core №. 1 before the work can commence, because that's what I see -- and, yes, technically it's incorrect what I'm saying here, but this is close enough to what is observed and hopefully comprehensible. Another bit is that some of the processes (text extraction without OCR) are more I/O-bound, rather than CPU-bound, so it all goes faster ( = less slow) when done on dedicated Flash storage: previously I ran this on an (SATA) SSD, now it's been tested on NVME and it feels like at least a factor 2 (I haven't timed it). But that is getting pricey in the hardware department when it's large/huge libraries you work with. Meanwhile, that's another reason why you won't see all your cores maxing out when Qiqqa runs its background processes, so 4-8 cores is nice, but more is very little gain right now.
re backlog and sniffing: I would expect it to be slower than usual when a library is re-indexed/re-OCR-ed[^3], but it should update after a little while.

Ow! 🤦 I need to mention that Qiqqa does not have 'refresh screen' code a la Windows Explorer & Browser (F5 key), so a way to trick the bugger into 'updating' the screen, e.g. the library document list, is to force a 'redraw event', either by resizing the window or by closing the tab and re-opening it -- the latter will have to re-create the overview list anew, while the former can just 'repaint what we have'. One way of doing this is toggling the Qiqqa window between 'maximized' and 'normal' (i.e. the usual-partly-covering-the-desktop) mode by clicking the 'maximize' button at top-right, next to [X] close. Be aware that when you do this, you're gaming the (Windows + WPF) system, which attempts to keep those 'redraw events' to a minimum. 😉

If Qiqqa keeps yakking about the document numbers in the status bar for a long time (say half an hour or more), then you might have run into a possible work queue prioritization bug (I do hope not! But reality has a way of catching me unaware some times), which should be (with some effort) discernable from the log files produced by Qiqqa. The 'way out' of that conundrum is re-starting Qiqqa, but I'ld be a bit careful about that as everything that's been flagged "no success, but let's not try that again, shall we?" will be re-issued as a work item, so a Qiqqa restart is more an attempt-to-workaround than a fix. Anyway, having a look at the log files (and the way they grow, too) can help giving you an idea what's happening under the hood. (You may want to use text search/filter tools such as Unix grep to filter the logfiles once you think you've found some interesting log lines in there, otherwise you'll be swamped in raw log output while you attempt to monitor/diagnose issues like these. I hope you're comfortable with a command line and tools like grep, because they help.)

re Qiqqa as a project: as I mentioned in the other reply: 2 years of glacial speed. Not a good combo: perfectionist and a car wreck of an dear old-timer. If you want to see / glean some movement, than most is happening over at the GerHobbelt/mupdf repo, which is where I'm working on the back-end side of Qiqqa: all that stuff is meant to replace QiqqaOCR.exe plus the BibTex and database work we do in Qiqqa: after a long time of "should I do this? am I crazy? can't I do it in what is already?" back-pedaling, I've come to the conclusion that there's no way for me, painful or not, but to rewrite the damn thing. And trying to be careful about cross-platform this time, i.e. Linux.

Mac is a platform that someone else will have to support (I'll try my best to make it easy) as I don't intend on spending another bundle of cash on hardware which I otherwise don't employ or enjoy.

The 'idea' here is to have a back-end that's scriptable so I can finally ask & use Qiqqa using small scripts or other web-like means to grab and inject data into it. a.k.a.: "opening up the backend and Qiqqa data+metadata storage". In that "idea", the Qiqqa GUI would be an (elaborate!) front-end, a bit like a fancy website, where you can 'browse' your data like we've all gotten used to with Qiqqa today. That way, I can use my D3 experience and other skills to get a working 'Qiqqa Expedition' again, etc. plus hopefully other developers willing and able to chime in (current Expedition graph view has some bad graphing bugs; by making Qiqqa more "open" (data/library access wise) I can do something to the PDF extract+OCR process, which currently, for me at least, is totally hosed, as current Qiqqa has enough text extract and OCR bugs + legacy issues to make about 20% fail (current Qiqqa: antique tesseract 3.0 only and no way to tweak/improve page scan images before Qiqqa attempts OCR -- which, by the way, explains why Qiqqa fails so badly on so many documents out there: take anything that's sufficiently old or badly scanned (shadows, faxed, etc.) and you are entering a world of text extraction hurt 🤕 Another thought that drives the 'openness' is that I might be easier able then to have Qiqqa cooperate with Zotero, which is then my answer to the question: how do I do my citations in my essay? -- as I'm not in the publish-or-perish business[^6] myself. 😉

For what it's worth: I've mostly finished other not-so-smart choices & projects (perfectionist + construction work, rebuilding a house for sale. Not a smart combo.), though one will be lingering for a long time (my own house), I've wondered about my motivation for doing all this, and the unfortunate end conclusion is that I'll have to continue with Qiqqa as there simply is nothing viable for me out there, even commercially. ABBYY, dtSearch, etc. are all nice and dandy, but fail to deliver the document data + metadata "googliness" offered by Qiqqa. And not just the "googliness" of searching but also the potential ease[^4] of data gathering, collecting and managing at an individual's or small team scope. All currently available document management systems I've encountered that work invariably require either a professional support team (not on a daily basis, but you are really hurting when there's no support on call either, so you'll notice bad things happening when a week has passed, usually.) or are custom jobs, where the team/company has had ample time to grow with and into the solution. Qiqqa is not ideal, but at least for me hits the sweet spot of offered features. If it wasn't for the bit rot... 😭 (e.g. b0rked sniffer thanks to the tug-of-war with Google Scholar (they don't want us around at all) vs. very old (obsoleted) web browser component, resulting in Captcha issues and what-not.)

TL;DR Mgmt. Summary

Re Qiqqa project & progress: I've had my doubts; my burn-out(s) -- probably a few more before I croak 🤷 -- but the grand total sum is: I have to go on an make it happen: 😉 Qiqqa^NG, i.e. Next Generation, which fixes all major problems (this is not a joke!) by circumnavigating the entire cesspool by rewriting the whole tool in dev environments that I'm comfortable in. Bye bye, the C# + WPF combo.

There's 2 years of (outwardly) glacial progress to date, which, as a historic KPI, extrapolating from it, isn't a, äh, strong indicator for early beta's arriving soon, so me saying 'Q4 2023' is, äh, optimistic with reasonably high probability, given we only have one FTE (moi), no (sponsored) dev team. 😅

On the positive side, there's clear intrinsic motivation still abundantly present, despite a plethora of setbacks, so there may be a future for Qiqqa. Bug support? yes, I try. Fixes? not so much, as that is considered a dead end in the current state and eating up time and effort budget that's otherwise assigned Qiqqa^NG, so only glaring issues which are solvable in reasonable time will make the cut for now.

@\all: Any help appreciated, by the way, as always. 😉

Totally & utterly off-topic: heck, I should deliver a UI with a slider where people can pick the level-of-detail they want to read, so they can dial the detail & detour amount up & down to their liking. 🤦 Yes. The hardest part of writing is cutting. 🙇

[^1]: Qiqqa was horribly slow starting up and showing some initial window and hand control to the user, when I would start the background processes immediately. So I stole the Microsoft Windows 10 solution for being 'quick to boot' ( = less slow): postpone anything that's non-critical until the start-up sequence has completed. One of the notable costs in Qiqqa is it's effectively[^2] loading the entire metadata database into memory (RAM) and given the way this was coded it's definitely not quick when your libraries start to hit 10K+ document numbers, like mine.

[^2]: Qiqqa expects each library to know about its documents. So it fetches the list of documents in that library. Each document is kept as a .NET object (trivial). What's not trivial, is that it is done 'naively', i.e. each document initializes its object by loading its metadata from the database. It does this through serialization and the result is kept 'forever'. Which means the library init phase will load the entire library documents' metadata collection into RAM, thus effectively caching the database in memory. But it was not engineered to do that, it's a side effect, so speed of loading is okay, as long as you don't start hitting memory size issues in (forced-by-other-components) 32-bit .NET, where the .NET memory management starts having to work for its keep, resulting in increased cost. This is also why large Qiqqa libraries have a tendency to crash: that's the 32-bit .NET memory manager giving up the ghost under duress.

[^3]: When you 'nuke' the Lucene search database (/index/*.*), Qiqqa will have to re-index, i.e. load all extracted PDF text contents and feed it to built-in Lucene.NET again. This process should be smart enough to use the cached OCR data: Qiqqa keeps a global document content cache in the /ocr/ directory, where it stores the extracted text content for each and every PDF it processed, 20 pages per file iff you're lucky. If you're less lucky, the cache is per-page, which means Qiqqa had to perform OCR on that page to get that data. The watch-words here are should: when you (also) nuke the /ocr/ directory you'll be sure to have to auto-OCR all documents, but if you haven't, then Qiqqa should be smart enough to discover that the OCR work has been done before and load the existing cache files from /ocr/. Sometimes that logic fails due to circumstances. Circumstances such as QiqqaOCR.exe being unable to read the PDF file, extract text for every page of the PDF, or failure to produce enough text for a page to pass the internal sanity heuristics: Qiqqa is not smart about that and simply re-issues every failed PDF page extract (plus accompanying OCR request) for every PDF and PDF page that previously failed. "Previously" as in: the last time you ran Qiqqa. Thus this re-issuing is only done once per Qiqqa session, but the bad news is this re-issuing is re-triggered every time you start the Qiqqa application again. Plus there's possibly still some buggery in there when those text extract and OCR processes don't deliver as expected (somewhere in the range of 5-10 legible words per page; I'ld have to check the source code to verify this). Part of what you're observing is Qiqqa being stubborn: "Oh! let's get another 20 pages of PDF text extracted!" -> QiqqaOCR. Fail? --> for page=+1 to +20 do: "Oh! let get the page content, shall we?" --> QiqqaOCR. Fail? --> "Oh dear! Let's do OCR this page this time!" --> QiqqaOCR. Fail? --> "Eh, are you sure? Let's do that OCR thing again, in a bit another way, perhaps?" --> QiqqaOCR. Fail? --> "Bugger it! Ditch the bloody page! Make it walk the plank!" (until the next time you start Qiqqa, where its background processes will again discover: "Oi! That page hasn't been processed yet, because I don't see no sensible text! Do the extract process, will you?" -- which led to me 'rephrasing' that last OCR attempt's failure as: "Bugger it! I don't want to see the bloody page again! Produce a couple of absolutely non-sensical QiqqaWhatever "words", declare them sane content and call that a success! Now file it in the search database already so you won't bother me with ill-fated re-issued attempts the next time I visit!" Obviously, that could have been done a little 'smarter'. At the cost of additional development time, so here we are. 😢

[^4]: potential ease as right now, Qiqqa is very much uncooperative with me, having a 70K+ document library, spanning about 30 years of work, consulting & related interests. For me, Qiqqa fails to produce a search index (32bit .NET crashes, joining hands with antique Lucene.NET in failing to survive a re-index due to severe memory constraints and consumption by the components involved), so that is/was a strong motivation to investigate 'external' FTS (full Text Search) solutions, such as Apache SOLR. While SOLR is beautiful, I cannot, in good conscience, "sell" that one as part of Qiqqa as it only "works" (like Elastic, etc.etc. -- nothing SOLR-specific here!) when you have or obtain expert knowledge about how to "fine-tune" these beasts: I don't need sharding on my hardware and with my library, but I have found that getting the search results I hope to get is quite costly in time spent on "tuning" the SOLR/Lucene rig. As this is a generic problem, or at least I consider it a non-specific issue, then logic tells me I won't get better results by picking something else. The counter there is that with some other systems I'm more able & willing to spend the time: I'm not a big fan of (coding in) Java, for one. While SOLR is easy in some aspects, it's still pretty hard to diagnose the entire indexing process, f.e.. Hence current effort is spent on finding out the tough spots with Sqlite3's FTS5 and (waiting on the shelf for me) manticore.search, both of which are fully Open Source.[^5] Those I can much more easily bundle with Qiqqa as a product for non-tech-savvy customers, hence to most folks using Qiqqa. (Heck, when I am using Qiqqa, I don't want to be a technician/administrator about it, I just want to use the tool because my focus is elsewhere then. So Qiqqa is only viable if it's easy to set up on an arbitrary personal machine, at least. While it got close, SOLR didn't make that particular cut, not for me at least. 😢 Hence the way forward for me, at least, is to produce a new working backend for Qiqqa, which provides the current search/indexing abilities at scale (70K+ documents and counting; the test corpus I use to test my mupdf/tesseract/etc. work is 300K+ docs / 12TB. Rediculous sizes perhaps, but those have uncovered some quite mesmerizing failures in both old (current Qiqqa) and new (GerHobbelt/mupdf) code, which turned out to be pretty severe -- a lot of 'fail' goes silently unnoticed, but sometimes you are 'lucky' to get a hard failure, a noticable crash, which (if lucky!) is reproducible too. One of the bugs I found in the new code (and waiting in the old QiqqaOCR code, if & when I fix several issues there first) was a real nasty race condition, which only fired semi-randomly for some document combo's when maxing out the CPU, and turned out to be a waiting disaster for all PDFs with a certain (quite common) image type embedded in their pages. That one took about two weeks to diagnose sufficiently so a potential fix could be tested. Qiqqa is simple stuff, until you lift the carpet and look. )

[^5]: one of the lessons that I had already learned a long time ago in my professional career, but unfortunately had to re-encounter again with Qiqqa, is that any commercial software libraries are an effin' curse on your application viability and lifetime/life cycle. SORAX which was used for PDF viewing and some PDF processing, is long defunct and gone. Large chunks of the Qiqqa GUI are done with Infragistics 'nice to have' SDK, which still exists, but I am utterly unwilling to cough up serious dough for something that I won't be using commercially or repeatedly: the choice to use WPF (instead of MFC or WinForms at the time Qiqqa was created) is another nail in my coffin and buying a developer Infragistics license would only be me telling myself my new phase in life is BDSM-happy and digging some serious savaging. Njet, tovaritch. QiqqaOCR uses further components (old tweaked mupdf, but no full source. old Lucene.NET with a similar 'source code is there, but which one is it?' Tombola) which were not published full source, and subsequent research didn't produce a built-from-amalgamated-source working equivalent either. The lesson: either you provide a full source tree and some means and directives to recompile/rebuild the bugger, or you're out. Too much hassle debugging/diagnosing deep system failures that way. That nasty bug I found took 2 weeks with a source code and supplements available. It would have ranked 'unsolvable in any reasonable timeframe' if I hadn't had full (editing!) access to the entire sourcecode while I was hunting down that issue. So: commercial software SDKs are only okay when I get paid by the hour for it. Otherwise, it's Open Source only. -- 🤔 pitty ABBYY isn't Open Source: I've seen some very decent OCR work come out of that one for files that tesseract and Nuance barfed on. Alas. 🤷

[^6]: a.k.a. academia 😉

SimonDedman commented 1 year ago

Thanks for this update mate, sorry for my slow reply.

Huge respect for the work you're putting in here, feels like you're building something which will be a massive asset for the whole community for the rest of our working lives.

Thanks, as always.

jimmejardine / qiqqa-open-source

Retain Textify, OCR progress, etc, in status bar? #409

TL;DR Mgmt. Summary