jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
380 stars 64 forks source link

BUG: Qiqqa does not allow me to select text from a pdf file that already has selectable text #165

Open raindropsfromsky opened 4 years ago

raindropsfromsky commented 4 years ago

I have a court order as a pdf file. It has machine-searchable text (as opposed to scanned images). I can open the file in Foxit pdf Reader and annotate the text (apply highlighter, add callouts and text boxes, etc.)

But Qiqqa does not allow me to select text from it with the Select text tool. image

I checked the security settings of the file, and they seem to be OK: image

If you would like to have this file for experimentation, please let me know.

GerHobbelt commented 4 years ago

If you would like to have this file for experimentation, please let me know.

Yes please ๐Ÿ‘

Various ways to send/submit the PDF (in order of my personal preference today):

I'll have a look as time allows.

raindropsfromsky commented 4 years ago

Here it is: SC judgement dtd 17-03-2020, on EIA for PRR, Bangalore.pdf

FWIW I am using Qiqqa as a guest (I haven't logged in, or used a special library).

GerHobbelt commented 4 years ago

Open Source Qiqqa doesn't have a user account of any kind. The old Commercial Qiqqa used a user account to

  1. identify who you were (for licensing purposes)
  2. use those credentials to give you access to the Qiqqa Cloud (which is not accessible anymore from OSS Qiqqa as that feature was taken out before Qiqqa went open source)
  3. identify you in the Qiqqa Chat (which is also not not available in Qiqqa Open Source)
GerHobbelt commented 4 years ago

And thanks for the PDF, by the way. ๐Ÿ‘

GerHobbelt commented 4 years ago

Had a look at what happened exactly. It has been enlightening as I discovered I was working with a couple of internal assumptions that are clearly based on developer rather than user experience influencing my user experience.

What is going on?

When Qiqqa imports the PDF into the library, a few things happen under the hood:

Both of these 'trigger' a request to fetch the document text, i.e. the OCR text.

What is "OCR text" (in this context)?

Qiqqa "OCR text" is the word text plus location rectangle coordinates collection extracted from the PDF by the OCR background process. Think of it as each word plus its precise position on the page, stored in a Qiqqa proprietary ocr file format.

How does Qiqqa obtain this OCR text?

That's where some confusion can occur: Qiqqa has two methods to extract text from a PDF. It does not matter which of these methods has produced that text content: either way it's stored in the "Qiqqa OCR text cache".

Text Extraction

The primary method is direct text extraction: using the mupdf tool, Qiqqa can get the text (plus coordinates) for any PDF which has a text layer embedded.

Your sample PDF is entirely processed by this first method, all 69 pages of it.

Text Recognition

When the primary method fails to deliver a text for a given page, that page is then re-queued to have it OCR-ed using a Tesseract-based subprocess. This is the secondary method for obtaining the text of a document (page).

How does this impact UX?

As long as Qiqqa does not have the PDF text available in its cache, it will disable any user activity that needs this data:

The background tasks mentioned before (inferring metadata) are postponed until the OCR text is available.

There a few more background tasks which have not been mentioned yet, including the one updating the text search index: that task of course requires the OCR text as well.

From a user perspective, one can say that text searching in Qiqqa will only pick up on the new documents after both the OCR process (methods 1 or 2, whatever it took to get some text out of those new PDFs) and the background Lucene text search indexing process have processed the new PDF documents.

Performance

Qiqqa may seem to be 'slow' in picking up new imported PDFs as the above processes all happen in the background and are currently set up to load the CPU only moderately: this was specifically done to make Qiqqa cope much better with large & huge libraries filled with technical datasheets and other PDF documents, which caused all sorts of trouble, including UI lockups and application crashes. (In commercial Qiqqa this included fatal crashes, where the application was unwilling to start up again and/or fatal loss of the text search index.)

Yes, we still have a way to go before Qiqqa will be fast and responsive as the current drive was first to make Qiqqa stable in such a 'large library' environment. To make Qiqqa behave well and responsive in various environments, it will take quite some more effort.

Now back on topic

Now we have a description of what goes on and an observed run, I can address the issue at hand:

My initial confusion was due to me thinking in Qiqqa coding terms: both text extraction and recognition are filed under the single title of "OCR-ing the text", because that's how Qiqqa approaches this under the hood.


To complicate matters further, there's also a couple of options to freeze the OCR/text extraction and/or all background processes. Suffice to say those options (in the Qiqqa Tools menu and Qiqqa Configuration window) are not active unless the user has activated them (e.g. a developer or power user testing Qiqqa or importing a large set of documents). The use of these options is out of scope.

raindropsfromsky commented 4 years ago

Ok, so in the same context, my observation is that for most of the time, I am not utilizing the PC resources (I have i3 8th Gen CPU with 12 GB DDR4 RAM). And yet the background process does not kick in. What is Qiqqa waiting for??

Note that in many cases, the laptop would go to sleep mode automatically after x minutes of no action. So Qiqqa must make maximum use of the available time.

In fact, Qiqqa should also have the feature to prevent the laptop from going to sleep, so that these pending activities can be finished ASAP.

raindropsfromsky commented 4 years ago

One more aspect is that when a file has some pages recognized with one pass, then the user should have access to those OCRed pages.

IIINW Qiqqa does not come back to those pages when the second process is working, right?

GerHobbelt commented 4 years ago

In theory, you should have access to those pages. In actual practice though... YMMV.

Qiqqa is, for the most part (waving my hands there), coded with on demand fetching of the OCR text (= "textified" = ocr cache), but quite a few actions trigger other bits, which then in turn "on demand" other bits, so sometimes it's a bit of a mess really what gets fetched and required when and by whom precisely. The text search index background update process and others are further complicating matters there, so I stick with my YMMV position until the entire document has been processed entirely.

Anyhow, that's the long and round-about way of saying this: Qiqqa does not fit in my brain in its entirety hence I tend to work with a couple of minimized assumptions which are strongly adjusted for "playing it safe". Guess that shrinking headroom is the devil in the detail of getting old. ๐Ÿ˜‰

On topic: I seem to recall that the Lucene text search database update process (running in the background) did revisit previously visited documents. And there was that bit I cannot recall, dang!...

[Insert possibly-maybe body shake here. Sorry, you'll have to deal with my inaccuracies, alas. :wink:]

raindropsfromsky commented 4 years ago

If you are willing to make a debug version to trace out this noodle logic, I am willing to experiment with various test cases. I am rather good at that!

GerHobbelt commented 4 years ago

So I noticed. ๐Ÿ˜„

GerHobbelt commented 4 years ago

BTW: also note that Qiqqa produces a logfile, which contains lots of dev/trace info, at

C:\Users\Ger\AppData\Local\Quantisle\Qiqqa\Logs\

where Users\Ger\ should be replaced by your own user path to Windows AppData.

GerHobbelt commented 4 years ago

And the noodle logic you refer to is deciphered by me using code review and (sometimes) debugging. Not all of the interactions in there are obvious, as quite a bit of it is due to UI panels being updated, which then trigger other UI updates. Anyway, signing off in a bit. Thanks again for all your input.

(And I'm curious about that huge PDF that caused the crash. My biggest is a 500MB monster scan of an old 1948 electronics book, but I didn't get a crash out of that one lately.)

raindropsfromsky commented 4 years ago

Mine is just 88 MB. But it has badly scanned pages and a mix of Hindi+English. I have posted the link to your email ID, but could you please add it to the "evil pdf" collection? (It is a publicly shared file anyway.) Thanks!