Open myrmoteras opened 2 years ago
For a batch of articles, I cannot say for sure, as it's impossible to foresee whether or not there are overarching patterns to exploit (at least without my crystal ball, which is broken) ... for a single article, this could be doable, provided there is sufficient regularity in each one in isolation ... how many articles are we talking?
Oh, and the file behind the londt_1982b.pdf
link is empty.
londt_1982b.pdf
Got it ... the OCR is remarkably good, actually, especially the portions in italics have remarkably few errors in comparison to other OCR we've seen ... trying out a tool using visual similarity of the scan snippets underneath the individual words to perform some clustering and correction in bulk ... the UI needs some work, but the general approach seems quite promising ...
One more thing: how many PDFs like this one? That's the pre-born-digital counterpart to the subject matter of https://github.com/plazi/ggi/issues/275, I take it?
One more thing: how many PDFs like this one? That's the pre-born-digital counterpart to the subject matter of plazi/ggi#275, I take it?
Yes, it's the same journal, before it changed names. No idea how many papers the journal itself will have, but we are currently aiming at 27 specific papers publised by Londt there.
In this case, the female symbol is recognized as a large bold 2. Is there a way to fix this in a batch of articles?
londt_1982b.pdf