raindropsfromsky commented 4 years ago

The Supreme Court judgment file already contains searchable text (it does not have scanned images).

It is a "pure text" file (no embedded images); which means it requires only textification stage; and no OCR stage.

This file is lying in Qiqqa for several days now (I have to mention this factor also because Qiqqa has this strange habit to keep tasks pending for days on end.). So I am assuming that Qiqqa has finally overcome its procrastination and finished all lazy background tasks.

The status line says All 8xx pages are searchable, with 0 to go, with a dark green highlight. (This is another weird feature of Qiqqa: The status line flashes random massages, which disappear after some time. I cannot check them whenever I need. At least, there is no apparent way.)

So I was under the impression that all was well.

But when I tried the Convert your pdf to text command.

To my shock, I found that many pages are reported as lost!

I checked out these pages, and indeed Qiqqa cannot select individual words with the Text select tool. (It is able to select words on other pages that are recognized well.)

This unpredictability of textification/OCR undermines Qiqqa's dependability! How can we rely on Qiqqa to search within such partially recognized docs??

So, this raises several questions:

Why did Qiqqa need OCR process, when Texification would have sufficed?
Why did Qiqqa hide the OCR warnings?
Why did the status line not report these failures?
When I browse, why does Qiqqa not place a warning on the defective pages of the document?
Why does Qiqqa give the reassuring message that "All 8xx pages are searchable, with 0 to go" when it already knows that they are NOT recognized?

GerHobbelt commented 4 years ago

Prelude

The QiqqaOCRFailedFakedWord.* "words" are a recent addition of mine as I ran into the same trouble as you and though there's the Qiqqa log output, it was very much unclear what exactly drove Qiqqa (in my case) to retry text extraction + OCR activity for several documents (and particular pages) ad nauseam:

These "faked" words signal that Qiqqa is unable to get some text from that particular page. Either because

the page is a full page graphic (in which case Qiqqa would have been correct not to find any words)
a page with a full page graphic which has some words in there, but Tesseract still throws out a "no can do" empty or crap result, which will end in the conclusion: "empty page".
a page OCR run by Tesseract where Tesseract fails to deliver anything usable (do note that I do not say legible here, as that is another can of worms for some PDFs)
a page Text Extraction action ("OCR" (but not really 😉 ) via mupdf delivers an empty result where Qiqqa somehow fails to notice (I believe I have covered this possibility in the code already (since the v82 releases), but I keep getting surprised by some very obscure PDFs out there in the wild once in a while, so I am hedging my bet here.
- there's no sanity check on the mupdf output, which in some very peculiar "obfuscated" PDFs can lead to very interesting results.

Anyway, that's about the list of causes I can come up with, in order of decreasing horribleness.

The "curious" bit of Qiqqa was (and in ways still is), at least from a user perspective, that it keeps re-trying the text extraction/OCR business an infinite number of runs, when the entire workflow does not succeed in delivering any words for a given page.

This user-observed behaviour has been forcibly stop gapped by me with those "fake words" being injected into the output when, at the end of all the things we tried in that workflow, there still is nothing to report home. The premise here being that once you've failed each stage in the text extraction a.k.a. "textify/OCR" process, then there's no reason to expect it to do better next time.

Note

Of course this introduces another subtle error cause into the mix: sometimes tools fail to run due to external circumstances, e.g. temporary I/O failures -- happens with USB-connected disks quite a lot when you hit them on first use after a long time having had their disks park and spin down.

Though this type of failure ("temporary failures") do happen in my experience, they are relatively rare.

The `QiqqaOCRFailedFakedWord.*` Stop Gap: Why did I do this?

Well, after going through the Qiqqa code multiple times and trying to come up with something smarter and neater, I found that those approaches would cost me some serious time as I'm re-learning to code in C#/.NET/WPF after about 10 years of not having had the pleasure (no trouble with C#, but WPF... ugh). So I came up with this hack when I was totally fed up with a CPU loading, unresponsive and power guzzling application that didn't know when it's enough. (This is one of the major reasons why Commercial Qiqqa was fatally crashing on my PDF library: all it takes are a few botched PDFs (see the Evil Library For Qiqqa repo; plenty party crashers in there) and you had yourself a beast out of control, in a nose dive it wouldn't ever recover from.

Most of my work on Open Source Qiqqa to date has been an effort to make it behave in the face of peril and onslaught from incoming PDFs, both old/faded scans, presentations etc., (i.e. stuff that is not a standard format research paper) and b0rked downloads.

With that out of the way, it's on with...

The Main Movement

The Supreme Court judgment file already contains searchable text (it does not have scanned images).

It is a "pure text" file (no embedded images); which means it requires only textification stage; and no OCR stage.

I guess that's not the one you sent me with 69 pages in it right? I've seen in your emails that you must have sent a second (large) one as you mention it, but that never made it through. Email with large attachments is often killed by internet providers, so it might be easier to upload it as part of this issue or put it in a shared dropbox, google drive or other cloud storage public share.

This file is lying in Qiqqa for several days now (I have to mention this factor also because Qiqqa has this strange habit to keep tasks pending for days on end.). So I am assuming that Qiqqa has finally overcome its procrastination and finished all lazy background tasks.

I'm quite interested in your Qiqqa log files. You can find those at

%appdata%\..\Local\Quantisle\Qiqqa\Logs

Copy that path into your Windows Explorer as shown below and you'll be redirected to your AppData dirtree where Qiqqa keeps it log files, as shown in the screenshot below.

2020-03-27_10-28-14

The status line says All 8xx pages are searchable, with 0 to go, with a dark green highlight. (This is another weird feature of Qiqqa: The status line flashes random massages, which disappear after some time. I cannot check them whenever I need. At least, there is no apparent way.)

Heh. The Status line. It's trying to say all sorts of things that a status line is absolutely useless for.

After a while you get an understanding of what's going on under the hood by looking at those messages.

The green bar bit is the completion percentage as estimated by Qiqqa internals, i.e. full green is "all done" (100%)

But when I tried the Convert your pdf to text command.

To my shock, I found that many pages are reported as lost!

I checked out these pages, and indeed Qiqqa cannot select individual words with the Text select tool. (It is able to select words on other pages that are recognized well.)

Remember that "textify/OCR" multi-stage process discussed before and mentioned in the Prelude above? This is indicative of Qiqqa indeed failing to get something decent (= a set of words) from both the mupdf process and the Tesseract process. Not necessarily due to those tools failing themselves, but when QiqqaOCR.exe observes a failure in the process at hand, it will output what's wrong, which ends up in your log files; that's why I am very interested in them and you might want have a look in them yourself as well! Just don't get fazed by the slew of messages in there: when things are truly down in the dump, those logfiles can fill like mad -- and that's intentional for the most part: my v82 releases spit out a humongous amount of log when compared to v80 or v79 (= Commercial Qiqqa) for analysis purposes, while I grow to understand the Qiqqa logic and codebase.)

This unpredictability of textification/OCR undermines Qiqqa's dependability! How can we rely on Qiqqa to search within such partially recognized docs??

Honest answer?

First off, I don't know what made Qiqqa decide on this particular path there. I will need that PDF for that, plus the logfiles would be a help as well!

Generally speaking though:

What/ever/ you use as a machine to support you, there's no guarantee ever. The only way to be sure is to make sure by manually vetting all output. And even then human error makes it a statistical chance instead of a guarantee -- compare with server up-time guarantees: absolutely no-body in the business (except some totally coked out marketing types) will sell you a 100% up-time guarantee. The money is in the number of 9s you can quote.

Same goes for document scanning and indexing: it's a how-many-9s-do-I-get? business.

And, granted, Qiqqa isn't a top contender, particularly when you use it outside its original field of use and development. That can be improved -- and I intend to make that happen. Only it won't happen overnight. Our discussions help me as they add perspective, ideas and possible alternative routes of attack, but in the end it's design (not the UI, but the process design) that has to be constructed and implemented. I hope you have patience for that. 😉

Aside: I went through a similar search process as you've described elsewhere and ended up with Commercial Qiqqa back in the day. Then, due to the nauseating crashes and boot failures, had to move away, used a few other tools, which was a disastrous experience, and then went back to Qiqqa and even spent time reverse-engineering it when it was NOT Open Sourced yet (nor didn't seem to ever become Open Source): https://github.com/GerHobbelt/qiqqa-revengin

If Qiqqa hadn't been Open Sourced, I would still be angry and probably try hard to find the time to code my own. While I know how much effort would be involved when coding such a beast from scratch and doing it better. 😨

Remember Nuance PaperPort (now Kofax PaperPort) I mentioned before elsewhere in our discussions?

That one's very similar to Qiqqa in some ways (at least when it comes to managing PDF digitizing workflows and searchability) and Nuance, always having had a keen eye for where the real money is to be found (healthcare and legal services), had it languish for several years. Sounds like Qiqqa? Hell, yeah. Only on a much bigger scale.

I was invited to a healthcare company (a bunch of professionals who had Inc.'ed themselves) several years ago and they were "Nuance based" in their toolchain and chained to it (thanks to higher-level politics), the question being if I could do something for them in improving some of the hairier PDF processes, which, incidentally, Nuance Support was not even answering questions about (they are still great at that, I hear, if you're not a card-carrying F500 member). Nuance was specifically targeting your field at the time (legal research, plus of course healthcare) and I liked the folks at that Inc. a lot, they cared, but in the end I was totally ineffective at providing a decent user-viable solution for their trouble. If I ever get to work with Nuance [products] again in a professional setting, I will seriously consider flossing my brain with the business end of a .45. With Qiqqa, I floss with the chamber empty. 😉

Nevertheless, if you want quick results that are above mediocre, you might want to look into PaperPort (and TextBridge? OmniPage? All together?) again as I see that Kofax has relieved Nuance of that entire branch - so there finally might be happening something good with those tools again.

Since you are not in the publish-or-perish business (a.k.a. university 😉 ) and need to make sure you've referenced absolutely everyone who needs to be referenced in your new upcoming paper, Zotero and friends are an ill match, so I understand why you landed at Qiqqa.

Apologies for this tangential rant, but sometimes when I work on Qiqqa I re-live a bit of that stuff as I recognize me having to do quite a few of the same things all over again. And they were asking the same question you do: "But how can we rely on this...?" And the hard answer is: no, you cannot. Not the way you see it, at least. You SHOULD treat it like a a kind of "local google on your own machine(s)": google doesn't find everything, everywhere, all the time. They killed babelsearch and several others because they had some smarter ideas, some luck and enough funding to win the tuning game, so googling is now an accepted verb in the English language. Still, my estimate is about 5% of my own PDF collection would never have gotten here if I had relied on google: some of the PDFs I have I know for certain never made it into the pool and some of them I've actively searched for after I had obtained them and google was unable to cough up anything even remotely close to it. Ditto for searching my collection: I don't expect perfect answers, but just hope I get lucky often enough with some nice results that save me time. Guaranteed search results is another 9s game as there's always some garbage input, so even when your search system is pitch-perfect, you're not really having a guaranteed delivery, just a very high probability.

Same here with Qiqqa: you gain the chance to get lucky more often than when you had employed old skool human labor, human recall and library tech. That's the win you get. You MAY get lucky more often. And then you can go and invest more or less heavily in tweaking the numbers of chance, i.e. "how many 9s can you give me?"

That doesn't mean I am okay of sorts with how it operates right now.

That's what #35 et al is really all about, when regarded from a high vantage point. And that, plus contextual conditions (team size, time available, etc.), is why this takes so long to accomplish.

Lemme descend from that eagle's nest and back down to basics:

So, this raises several questions:

Why did Qiqqa need OCR process, when Texification would have sufficed?

Why did Qiqqa hide the OCR warnings?

Why did the status line not report these failures?

When I browse, why does Qiqqa not place a warning on the defective pages of the document?

Why does Qiqqa give the reassuring message that "All 8xx pages are searchable, with 0 to go" > when it already knows that they are NOT recognized?

That depends on what happened exactly in that first text extraction phase: did mupdf+QiqqaOCR deliver something?

This needs logfiles to get anywhere near a solid answer.
That's Qiqqa's user-friendly design I guess: many users would not be interested to be bothered with the gory details as they have a mental model of using it as a "best effort local google" for their document set, perhaps? Mostly, I expect, this is due to both of us using Qiqqa outside the realm it originated and was designed for: writing papers at universities. That, at least, makes the failure modes more apparent and frequent.
Ditto as number 2: failures end up in the log files only.
Because apparently the sort of document collection process where you want detailed control over the quality of the results and/or the conversion success numbers a.k.a. reliability estimates was never envisioned in the original design. Qiqqa doesn't store that type of metadata.

If you have a cynic side (like me) then you might not be surprised to hear that the mupdf textify=text extraction and Tesseract OCR subprocesses do deliver confidence estimates with each word in a page, and that confidence number is stored in the proprietary OCR/textify cache files, but is used nowhere: it is also not incorporated in the search index nor its output postprocessing where such confidence data MAY impact search result output order as lower confidence values might be considered less desirable then high confidence ones, iff you're so inclined. Want proof of that? Click here: github will tell you! 😈

Hence Qiqqa acts on the premises that any textified page is a good little reliable page and every word is bingo perfect. While you do get search hit percentages, those percentages come from the search result itself: how well it matches your search criteria, but still Qiqqa is assuming everything the search index spits up is of pristine quality.
Because that's technically true at least: Qiqqa doesn't keep tabs on failures there, so once it is done, it is done.

This may sound horrible to your ears but is the sensible conclusion when you consider how Qiqqa is designed (or at least seems to have been designed): when there's no user process in place to notify you of errors in such a way that you may be able to act instead of merely observe, when there's no process to give you any modicum of control over the texity/OCR process (except switching languages), then keeping a tally of failed pages (what is failure, then, again?) is only being... cruel. Because you can look at the number then:
```
    57 pages are b0rked, the rest of the 800+ have been indexed! You're good to go! Nya nya nya!
```
but you have no means at influencing this number. Better to keep it simpler then. So, yes, as long as Qiqqa doesn't have a severely altered backend text extraction process, every page that's done is "good to go". I don't like it either, but I hope you can see my point here: what does it do with a user being informed like that and them having zero control over how it went down after all? (And I'm not talking about power-users, who might start something like https://github.com/GerHobbelt/qiqqa-revengin , etc.)

Got that off my chest. Thanks for listening. Or skimming it.

raindropsfromsky commented 4 years ago

Wow! Hats off for that amazing effort. I will take some time to digest this,

But in the meantime, here's the log files. I zipped them all. But prima facie this is yet another argument in favor of hiving off the OCR function. It's pure headache.

On Fri, Mar 27, 2020 at 6:00 PM Ger Hobbelt notifications@github.com wrote:

Prelude

The QiqqaOCRFailedFakedWord. "words" are a recent addition of mine as I ran into the same trouble as you and though there's the Qiqqa log output, it was very much unclear what exactly drove Qiqqa (in my case) to retry text extraction + OCR activity for several documents (and particular pages) ad nauseam*:

These "faked" words signal that Qiqqa is unable to get some text from that particular page. Either because

the page is a full page graphic (in which case Qiqqa would have been correct not to find any words)

a page with a full page graphic which has some words in there, but Tesseract still throws out a "no can do" empty or crap result, which will end in the conclusion: "empty page".

a page OCR run by Tesseract where Tesseract fails to deliver anything usable (do note that I do not say legible here, as that is another can of worms for some PDFs)

a page Text Extraction action ("OCR" (but not really 😉 ) via mupdf delivers an empty result where Qiqqa somehow fails to notice (I believe I have covered this possibility in the code already (since the v82 releases), but I keep getting surprised by some very obscure PDFs out there in the wild once in a while, so I am hedging my bet here.

there's no sanity check on the mupdf output, which in some very peculiar "obfuscated" PDFs can lead to very interesting results.

Anyway, that's about the list of causes I can come up with, in order of decreasing horribleness.

The "curious" bit of Qiqqa was (and in ways still is), at least from a user perspective, that it keeps re-trying the text extraction/OCR business an infinite number of runs, when the entire workflow does not succeed in delivering any words for a given page.

This user-observed behaviour has been forcibly stop gapped by me with those "fake words" being injected into the output when, at the end of all the things we tried in that workflow, there still is nothing to report home. The premise here being that once you've failed each stage in the text extraction a.k.a. "textify/OCR" process, then there's no reason to expect it to do better next time.

Note

Of course this introduces another subtle error cause into the mix: sometimes tools fail to run due to external circumstances, e.g. temporary I/O failures -- happens with USB-connected disks quite a lot when you hit them on first use after a long time having had their disks park and spin down.

Though this type of failure ("temporary failures") do happen in my experience, they are relatively rare.

The QiqqaOCRFailedFakedWord.* Stop Gap: Why did I do this?

Well, after going through the Qiqqa code multiple times and trying to come up with something smarter and neater, I found that those approaches would cost me some serious time as I'm re-learning to code in C#/.NET/WPF after about 10 years of not having had the pleasure (no trouble with C#, but WPF... ugh). So I came up with this hack when I was totally fed up with a CPU loading, unresponsive and power guzzling application that didn't know when it's enough. (This is one of the major reasons why Commercial Qiqqa was fatally crashing on my PDF library: all it takes are a few botched PDFs (see the Evil Library For Qiqqa repo; plenty party crashers in there) and you had yourself a beast out of control, in a nose dive it wouldn't ever recover from.

Most of my work on Open Source Qiqqa to date has been an effort to make it behave in the face of peril and onslaught from incoming PDFs, both old/faded scans, presentations etc., (i.e. stuff that is not a standard format research paper) and b0rked https://www.youtube.com/watch?v=B7UmUX68KtE&list=PLTGx9Ld9-ME3ZoXjDtMX0bIhC-oxVGg63 downloads.

With that out of the way, it's on with... The Main Movement

The Supreme Court judgment file already contains searchable text (it does not have scanned images).

It is a "pure text" file (no embedded images); which means it requires only textification stage; and no OCR stage.

I guess that's not the one you sent me with 69 pages in it right? I've seen in your emails that you must have sent a second (large) one as you mention it, but that never made it through. Email with large attachments is often killed by internet providers, so it might be easier to upload it as part of this issue or put it in a shared dropbox, google drive or other cloud storage public share.

This file is lying in Qiqqa for several days now (I have to mention this factor also because Qiqqa has this strange habit to keep tasks pending for days on end.). So I am assuming that Qiqqa has finally overcome its procrastination and finished all lazy background tasks.

I'm quite interested in your Qiqqa log files. You can find those at

%appdata%..\Local\Quantisle\Qiqqa\Logs

Copy that path into your Windows Explorer as shown below and you'll be redirected to your AppData dirtree where Qiqqa keeps it log files, as shown in the screenshot below.

[image: 2020-03-27_10-28-14] https://user-images.githubusercontent.com/402462/77741971-02653500-7016-11ea-9a77-ec023cb0c9d2.png

The status line says All 8xx pages are searchable, with 0 to go, with a dark green highlight. (This is another weird feature of Qiqqa: The status line flashes random massages, which disappear after some time. I cannot check them whenever I need. At least, there is no apparent way.)

Heh. The Status line. It's trying to say all sorts of things that a status line is absolutely useless for.

After a while you get an understanding of what's going on under the hood by looking at those messages.

The green bar bit is the completion percentage as estimated by Qiqqa internals, i.e. full green is "all done" (100%)

But when I tried the Convert your pdf to text command.

To my shock, I found that many pages are reported as lost!

[image: image] https://user-images.githubusercontent.com/9047168/77720439-f6757500-700d-11ea-8e8b-f352e5936fb0.png

I checked out these pages, and indeed Qiqqa cannot select individual words with the Text select tool. (It is able to select words on other pages that are recognized well.)

Remember that "textify/OCR" multi-stage process discussed before and mentioned in the Prelude above? This is indicative of Qiqqa indeed failing to get something decent (= a set of words) from both the mupdf process and the Tesseract process. Not necessarily due to those tools failing themselves, but when QiqqaOCR.exe observes a failure in the process at hand, it will output what's wrong, which ends up in your log files; that's why I am very interested in them and you might want have a look in them yourself as well! Just don't get fazed by the slew of messages in there: when things are truly down in the dump, those logfiles can fill like mad -- and that's intentional for the most part: my v82 releases spit out a humongous amount of log when compared to v80 or v79 (= Commercial Qiqqa) for analysis purposes, while I grow to understand the Qiqqa logic and codebase.)

This unpredictability of textification/OCR undermines Qiqqa's dependability! How can we rely on Qiqqa to search within such partially recognized docs??

Honest answer?

First off, I don't know what made Qiqqa decide on this particular path there. I will need that PDF for that, plus the logfiles would be a help as well!

Generally speaking though:

What/ever/ you use as a machine to support you, there's no guarantee ever. The only way to be sure is to make sure by manually vetting all output. And even then human error makes it a statistical chance instead of a guarantee -- compare with server up-time guarantees: absolutely no-body in the business (except some totally coked out marketing types) will sell you a 100% up-time guarantee. The money is in the number of 9s you can quote.

Same goes for document scanning and indexing: it's a how-many-9s-do-I-get? business.

And, granted, Qiqqa isn't a top contender, particularly when you use it outside its original field of use and development. That can be improved -- and I intend to make that happen. Only it won't happen overnight. Our discussions help me as they add perspective, ideas and possible alternative routes of attack, but in the end it's design (not the UI, but the process design) that has to be constructed and implemented. I hope you have patience for that. 😉

Aside: I went through a similar search process as you've described elsewhere and ended up with Commercial Qiqqa back in the day. Then, due to the nauseating crashes and boot failures, had to move away, used a few other tools, which was a disastrous experience, and then went back to Qiqqa and even spent time reverse-engineering it when it was NOT Open Sourced yet (nor didn't seem to ever become Open Source): https://github.com/GerHobbelt/qiqqa-revengin

If Qiqqa hadn't been Open Sourced, I would still be angry and probably try hard to find the time to code my own. While I know how much effort would be involved when coding such a beast from scratch and doing it better. 😨

Remember Nuance PaperPort https://en.wikipedia.org/wiki/PaperPort (now Kofax PaperPort http://support.nuance.com/usersguides/default.asp) I mentioned before elsewhere in our discussions?

That one's very similar to Qiqqa in some ways (at least when it comes to managing PDF digitizing workflows and searchability) and Nuance, always having had a keen eye for where the real money is to be found (healthcare and legal services), had it languish for several years. Sounds like Qiqqa? Hell, yeah. Only on a much bigger scale.

I was invited to a healthcare company (a bunch of professionals who had Inc.'ed themselves) several years ago and they were "Nuance based" in their toolchain and chained to it (thanks to higher-level politics), the question being if I could do something for them in improving some of the hairier PDF processes, which, incidentally, Nuance Support was not even answering questions about (they are still great at that, I hear, if you're not a card-carrying F500 member). Nuance was specifically targeting your field at the time (legal research, plus of course healthcare) and I liked the folks at that Inc. a lot, they cared, but in the end I was totally ineffective at providing a decent user-viable solution for their trouble. If I ever get to work with Nuance [products] again in a professional setting, I will seriously consider flossing my brain with the business end of a .45. With Qiqqa, I floss with the chamber empty. 😉

Nevertheless, if you want quick results that are above mediocre, you might want to look into PaperPort (and TextBridge? OmniPage? All together?) again as I see that Kofax has relieved Nuance of that entire branch - so there finally might be happening something good with those tools again.

Since you are not in the publish-or-perish business (a.k.a. university 😉 ) and need to make sure you've referenced absolutely everyone who needs to be referenced in your new upcoming paper, Zotero and friends are an ill match, so I understand why you landed at Qiqqa.

Apologies for this tangential rant, but sometimes when I work on Qiqqa I re-live a bit of that stuff as I recognize me having to do quite a few of the same things all over again. And they were asking the same question you do: "But how can we rely on this...?" And the hard answer is: no, you cannot. Not the way you see it, at least. You SHOULD treat it like a a kind of "local google on your own machine(s)": google doesn't find everything, everywhere, all the time. They killed babelsearch and several others because they had some smarter ideas, some luck and enough funding to win the tuning game, so googling is now an accepted verb in the English language. Still, my estimate is about 5% of my own PDF collection would never have gotten here if I had relied on google: some of the PDFs I have I know for certain never made it into the pool and some of them I've actively searched for after I had obtained them and google was unable to cough up anything even remotely close to it. Ditto for searching my collection: I don't expect perfect answers, but just hope I get lucky often enough with some nice results that save me time. Guaranteed search results is another 9s game as there's always some garbage input, so even when your search system is pitch-perfect, you're not really having a guaranteed delivery, just a very high probability.

Same here with Qiqqa: you gain the chance to get lucky more often than when you had employed old skool human labor, human recall and library tech. That's the win you get. You MAY get lucky more often. And then you can go and invest more or less heavily in tweaking the numbers of chance, i.e. "how many 9s can you give me?"

That doesn't mean I am okay of sorts with how it operates right now.

That's what #35 https://github.com/jimmejardine/qiqqa-open-source/issues/35 et al is really all about, when regarded from a high vantage point. And that, plus contextual conditions (team size, time available, etc.), is why this takes so long to accomplish.

Lemme descend from that eagle's nest and back down to basics:

So, this raises several questions:
Why did Qiqqa need OCR process, when Texification would have sufficed?

Why did Qiqqa hide the OCR warnings?

Why did the status line not report these failures?

When I browse, why does Qiqqa not place a warning on the defective pages of the document?

Why does Qiqqa give the reassuring message that "All 8xx pages are searchable, with 0 to go" > when it already knows that they are NOT recognized?

That depends on what happened exactly in that first text extraction phase: did mupdf+QiqqaOCR deliver something?

This needs logfiles to get anywhere near a solid answer.

That's Qiqqa's user-friendly design I guess: many users would not be interested to be bothered with the gory details as they have a mental model of using it as a "best effort local google" for their document set, perhaps? Mostly, I expect, this is due to both of us using Qiqqa outside the realm it originated and was designed for: writing papers at universities. That, at least, makes the failure modes more apparent and frequent.

Ditto as number 2: failures end up in the log files only.

Because apparently the sort of document collection process where you want detailed control over the quality of the results and/or the conversion success numbers a.k.a. reliability estimates was never envisioned in the original design. Qiqqa doesn't store that type of metadata.

If you have a cynic side (like me) then you might not be surprised to hear that the mupdf textify=text extraction and Tesseract OCR subprocesses do deliver confidence estimates with each word in a page, and that confidence number is stored in the proprietary OCR/textify cache files, but is used nowhere: it is also not incorporated in the search index nor its output postprocessing where such confidence data MAY impact search result output order as lower confidence values might be considered less desirable then high confidence ones, iff you're so inclined. Want proof of that? Click here: github will tell you! https://github.com/jimmejardine/qiqqa-open-source/search?q=confidence&unscoped_q=confidence 😈

Hence Qiqqa acts on the premises that any textified page is a good little reliable page and every word is bingo perfect. While you do get search hit percentages, those percentages come from the search result itself: how well it matches your search criteria, but still Qiqqa is assuming everything the search index spits up is of pristine quality.
Because that's technically true at least: Qiqqa doesn't keep tabs on failures there, so once it is done, it is done.

This may sound horrible to your ears but is the sensible conclusion when you consider how Qiqqa is designed (or at least seems to have been designed): when there's no user process in place to notify you of errors in such a way that you may be able to act instead of merely observe, when there's no process to give you any modicum of control over the texity/OCR process (except switching languages), then keeping a tally of failed pages (what is failure, then, again?) is only being... cruel. Because you can look at the number then:
57 pages are b0rked, the rest of the 800+ have been indexed! You're good to go! Nya nya nya!
but you have no means at influencing this number. Better to keep it simpler then. So, yes, as long as Qiqqa doesn't have a severely altered backend text extraction process, every page that's done is "good to go". I don't like it either, but I hope you can see my point here: what does it do with a user being informed like that and them having zero control over how it went down after all? (And I'm not talking about power-users, who might start something like https://github.com/GerHobbelt/qiqqa-revengin , etc.)
Got that off my chest. Thanks for listening. Or skimming it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jimmejardine/qiqqa-open-source/issues/193#issuecomment-604973453, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACFAZAFVMY5SJFWTG3LH3ELRJSL53ANCNFSM4LUXH62A .

raindropsfromsky commented 4 years ago

I have gone through your reasoning, and I agree.

I also realized that you and I have stretched the app beyond its original design parameters. IINW the original design was for average document that has only a few pages, and with searchable text (and not scanned images). Further, it mostly contains plain text, rather than images embedded in text.

That said, some of the behavior does not seem logical. You have pointed out some of these idiosyncrasies. But there are many more.

I know that Tesseract itself has certain limit on its accuracy on character-recognition, which is dependent on the font size, the contrast, the dpi of the print that is scanned, and bleeding of ink on paper that changes the shape of the characters. I also know that the accuracy can be increased by training the engine with multiple samples of the same character.

Finally, before the scanning, the user should be allowed to adjust the text/image/table blocks, and tweak the contrast+brightness+gamma curve to get the best results. After the scanning, the user should be allowed to compare the result with original and correct all mistakes (in some apps, the GUI highlights the potentially wrong outcome, and a spellcheck highlights spelling mistakes.)

But all that is missing in Qiqqa. So we are sacrificing a lot of accuracy. But this may not matter for a paper-writer because he has compiled thousands of papers on just one subject. That builds a lot of redundancy, and the same information would have come from other papers anyway.

Thus, on the whole, the Qiqqa system is fault-tolerant.

But in my case, a specific document set has unique reference information, or another set of documents that is used as evidence for a case. Here there is no redundancy, and the cost of failure is very high (literally life-or-death; or a few years of jail.) Thus in my case everything has to be double/triple-checked with human intelligence and due diligence. In such a workflow, Qiqqa can lighten my load, but can never automate it.

But if Qiqqa has such design compromises, I doubt how much I can rely on it.

Even then I am hopeful that we can salvage it by working on the original logic; or at least by adding an OCR engine with better accuracy and manual stages.

Let us work offline to make a systematic list of such behavior and try to analyze that. While I can do only black box testing, you can check the code.

jimmejardine / qiqqa-open-source

Critical bug: Qiqqa does not report major failures in texifying a file; blames it on OCR! #193

Prelude

Note

The `QiqqaOCRFailedFakedWord.*` Stop Gap: Why did I do this?

The Main Movement

Got that off my chest. Thanks for listening. Or skimming it.

If Qiqqa hadn't been Open Sourced, I would still be angry and probably try hard to find the time to code my own. While I know how much effort would be involved when coding such a beast from scratch and doing it better. 😨

Since you are not in the publish-or-perish business (a.k.a. university 😉 ) and need to make sure you've referenced absolutely everyone who needs to be referenced in your new upcoming paper, Zotero and friends are an ill match, so I understand why you landed at Qiqqa.

That's what #35 https://github.com/jimmejardine/qiqqa-open-source/issues/35 et al is really all about, when regarded from a high vantage point. And that, plus contextual conditions (team size, time available, etc.), is why this takes so long to accomplish.