plazi / arcadia-project

2 stars 1 forks source link

microservice: what to do with files that do not process? #205

Open myrmoteras opened 1 year ago

myrmoteras commented 1 year ago

here are two files that I uploaded through the microservice process and do not process properly. https://github.com/plazi/GoldenGATE-Imagine/issues/29

How are we dealing with this?

gsautter commented 1 year ago

Your best bet might be downloading the PDF and decoding it locally, where you can play with the options ... the server side decoder inevitably uses default settings, which are somewhat conservative because you never know what's coming in ... the font reference should increase accuracy considerably - once we have it.

flsimoes commented 1 year ago

The Taiwania paper Donat sent us yesterday for processing took a long time to decode, but eventually opened without font problems, so that could be a hint

gsautter commented 1 year ago

The Taiwania paper Donat sent us yesterday for processing took a long time to decode, but eventually opened without font problems, so that could be a hint

What that indicates is basically that decoding did succeed based upon the built-in reference fonts (FreeSerif and FreeSans) ... takes forever to match the glyphs to the reference at the bitmap level, but it does work reasonably well in most cases.

gsautter commented 1 year ago

The idea behind the font reference is to broaden the basis beyond the liberation fonts, and hopefully achieve identical matches right away for the majority of the glyphs (minuscule differences tend to disappear at 32x32 pixels), so to both improve accuracy and performance ... another aspect is to cover a broader range of representations for glyphs that exhibit considerable variability between fonts (e.g. the male symbol) and also increase accuracy in that department.