Open raindropsfromsky opened 4 years ago
If you would like to have this file for experimentation, please let me know.
Yes please ๐
Various ways to send/submit the PDF (in order of my personal preference today):
I'll have a look as time allows.
Here it is: SC judgement dtd 17-03-2020, on EIA for PRR, Bangalore.pdf
FWIW I am using Qiqqa as a guest (I haven't logged in, or used a special library).
Open Source Qiqqa doesn't have a user account of any kind. The old Commercial Qiqqa used a user account to
And thanks for the PDF, by the way. ๐
Had a look at what happened exactly. It has been enlightening as I discovered I was working with a couple of internal assumptions that are clearly based on developer rather than user experience influencing my user experience.
When Qiqqa imports the PDF into the library, a few things happen under the hood:
Both of these 'trigger' a request to fetch the document text, i.e. the OCR text.
Qiqqa "OCR text" is the word text plus location rectangle coordinates collection extracted from the PDF by the OCR background process. Think of it as each word plus its precise position on the page, stored in a Qiqqa proprietary ocr file format.
That's where some confusion can occur: Qiqqa has two methods to extract text from a PDF. It does not matter which of these methods has produced that text content: either way it's stored in the "Qiqqa OCR text cache".
The primary method is direct text extraction: using the mupdf
tool, Qiqqa can get the text (plus coordinates) for any PDF which has a text layer embedded.
Your sample PDF is entirely processed by this first method, all 69 pages of it.
When the primary method fails to deliver a text for a given page, that page is then re-queued to have it OCR-ed using a Tesseract-based subprocess. This is the secondary method for obtaining the text of a document (page).
As long as Qiqqa does not have the PDF text available in its cache, it will disable any user activity that needs this data:
The background tasks mentioned before (inferring metadata) are postponed until the OCR text is available.
There a few more background tasks which have not been mentioned yet, including the one updating the text search index: that task of course requires the OCR text as well.
From a user perspective, one can say that text searching in Qiqqa will only pick up on the new documents after both the OCR process (methods 1 or 2, whatever it took to get some text out of those new PDFs) and the background Lucene text search indexing process have processed the new PDF documents.
Qiqqa may seem to be 'slow' in picking up new imported PDFs as the above processes all happen in the background and are currently set up to load the CPU only moderately: this was specifically done to make Qiqqa cope much better with large & huge libraries filled with technical datasheets and other PDF documents, which caused all sorts of trouble, including UI lockups and application crashes. (In commercial Qiqqa this included fatal crashes, where the application was unwilling to start up again and/or fatal loss of the text search index.)
Yes, we still have a way to go before Qiqqa will be fast and responsive as the current drive was first to make Qiqqa stable in such a 'large library' environment. To make Qiqqa behave well and responsive in various environments, it will take quite some more effort.
Now we have a description of what goes on and an observed run, I can address the issue at hand:
as described above, Qiqqa will take some time before it runs and completes the new document(s) text extraction and then allow text marking and selecting actions. Up till that moment those user activities are disallowed.
Hence these activities should be possible after some patience has been exercised. (Unless the PDF is one of the crappy sort, causing the "OCR" methods trouble, which is yet another chapter. ๐ )
My initial confusion was due to me thinking in Qiqqa coding terms: both text extraction and recognition are filed under the single title of "OCR-ing the text", because that's how Qiqqa approaches this under the hood.
To complicate matters further, there's also a couple of options to freeze the OCR/text extraction and/or all background processes. Suffice to say those options (in the Qiqqa Tools menu and Qiqqa Configuration window) are not active unless the user has activated them (e.g. a developer or power user testing Qiqqa or importing a large set of documents). The use of these options is out of scope.
Ok, so in the same context, my observation is that for most of the time, I am not utilizing the PC resources (I have i3 8th Gen CPU with 12 GB DDR4 RAM). And yet the background process does not kick in. What is Qiqqa waiting for??
Note that in many cases, the laptop would go to sleep mode automatically after x minutes of no action. So Qiqqa must make maximum use of the available time.
In fact, Qiqqa should also have the feature to prevent the laptop from going to sleep, so that these pending activities can be finished ASAP.
One more aspect is that when a file has some pages recognized with one pass, then the user should have access to those OCRed pages.
IIINW Qiqqa does not come back to those pages when the second process is working, right?
In theory, you should have access to those pages. In actual practice though... YMMV.
Qiqqa is, for the most part (waving my hands there), coded with on demand fetching of the OCR text (= "textified" = ocr cache), but quite a few actions trigger other bits, which then in turn "on demand" other bits, so sometimes it's a bit of a mess really what gets fetched and required when and by whom precisely. The text search index background update process and others are further complicating matters there, so I stick with my YMMV position until the entire document has been processed entirely.
Anyhow, that's the long and round-about way of saying this: Qiqqa does not fit in my brain in its entirety hence I tend to work with a couple of minimized assumptions which are strongly adjusted for "playing it safe". Guess that shrinking headroom is the devil in the detail of getting old. ๐
On topic: I seem to recall that the Lucene text search database update process (running in the background) did revisit previously visited documents. And there was that bit I cannot recall, dang!...
[Insert possibly-maybe body shake here. Sorry, you'll have to deal with my inaccuracies, alas. :wink:]
If you are willing to make a debug version to trace out this noodle logic, I am willing to experiment with various test cases. I am rather good at that!
So I noticed. ๐
BTW: also note that Qiqqa produces a logfile, which contains lots of dev/trace info, at
C:\Users\Ger\AppData\Local\Quantisle\Qiqqa\Logs\
where Users\Ger\ should be replaced by your own user path to Windows AppData.
And the noodle logic you refer to is deciphered by me using code review and (sometimes) debugging. Not all of the interactions in there are obvious, as quite a bit of it is due to UI panels being updated, which then trigger other UI updates. Anyway, signing off in a bit. Thanks again for all your input.
(And I'm curious about that huge PDF that caused the crash. My biggest is a 500MB monster scan of an old 1948 electronics book, but I didn't get a crash out of that one lately.)
Mine is just 88 MB. But it has badly scanned pages and a mix of Hindi+English. I have posted the link to your email ID, but could you please add it to the "evil pdf" collection? (It is a publicly shared file anyway.) Thanks!
I have a court order as a pdf file. It has machine-searchable text (as opposed to scanned images). I can open the file in Foxit pdf Reader and annotate the text (apply highlighter, add callouts and text boxes, etc.)
But Qiqqa does not allow me to select text from it with the Select text tool.
I checked the security settings of the file, and they seem to be OK:
If you would like to have this file for experimentation, please let me know.