impresso / impresso-text-acquisition

🛠️ Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.
https://impresso.github.io/impresso-text-acquisition/
GNU Affero General Public License v3.0
7 stars 2 forks source link

L'essor 2006-2015 has terrible text recognition #88

Open simon-clematide opened 4 years ago

simon-clematide commented 4 years ago

Something went pretty wrong on that. The https://www.e-newspaperarchives.ch/?a=d&d=LES20070601-01.2.13&e=-------fr-20--1--img-txIN--------0----- does not seem to suffer from that. Perfect text and perfect layout recognition

LES/LES-2007_ocrqa.jsonl.err:#MEANOCRQA 0.19164556962025325
LES/LES-2008_ocrqa.jsonl.err:#MEANOCRQA 0.20903409090909086
LES/LES-2011_ocrqa.jsonl.err:#MEANOCRQA 0.209113924050633
LES/LES-2010_ocrqa.jsonl.err:#MEANOCRQA 0.20980392156862743
LES/LES-2006_ocrqa.jsonl.err:#MEANOCRQA 0.2110344827586207
LES/LES-2009_ocrqa.jsonl.err:#MEANOCRQA 0.217888198757764
LES/LES-2014_ocrqa.jsonl.err:#MEANOCRQA 0.2277848101265822
LES/LES-2013_ocrqa.jsonl.err:#MEANOCRQA 0.2373456790123456
LES/LES-2012_ocrqa.jsonl.err:#MEANOCRQA 0.23814102564102568
LES/LES-2015_ocrqa.jsonl.err:#MEANOCRQA 0.24166666666666667