plazi / GoldenGATE-Imagine

A GUI Tool For Freeing Text and Data from PDF Documents
Other
5 stars 0 forks source link

batch replacement of OCR errors #21

Open myrmoteras opened 2 years ago

myrmoteras commented 2 years ago

In this case, the female symbol is recognized as a large bold 2. Is there a way to fix this in a batch of articles?

image image londt_1982b.pdf

gsautter commented 2 years ago

For a batch of articles, I cannot say for sure, as it's impossible to foresee whether or not there are overarching patterns to exploit (at least without my crystal ball, which is broken) ... for a single article, this could be doable, provided there is sufficient regularity in each one in isolation ... how many articles are we talking?

gsautter commented 2 years ago

Oh, and the file behind the londt_1982b.pdf link is empty.

flsimoes commented 2 years ago

londt_1982b.pdf

gsautter commented 2 years ago

londt_1982b.pdf

Got it ... the OCR is remarkably good, actually, especially the portions in italics have remarkably few errors in comparison to other OCR we've seen ... trying out a tool using visual similarity of the scan snippets underneath the individual words to perform some clustering and correction in bulk ... the UI needs some work, but the general approach seems quite promising ...

gsautter commented 2 years ago

One more thing: how many PDFs like this one? That's the pre-born-digital counterpart to the subject matter of https://github.com/plazi/ggi/issues/275, I take it?

flsimoes commented 2 years ago

One more thing: how many PDFs like this one? That's the pre-born-digital counterpart to the subject matter of plazi/ggi#275, I take it?

Yes, it's the same journal, before it changed names. No idea how many papers the journal itself will have, but we are currently aiming at 27 specific papers publised by Londt there.