jwilk-archive / ocrodjvu

OCR for DjVu
GNU General Public License v2.0
44 stars 19 forks source link

Allow editing hOCR (or TSV) files #19

Open jwilk opened 8 years ago

jwilk commented 8 years ago

Issue reported by @jsbien:

I would like very much to be able to correct OCR recognition errors in the hOCR files before converting them to djvused scripts and using as the hidden text.

It can be a run-time option or a separate utility hocr2djvused.

Actually it would be probably better to replace hOCR files with TSV (currently available only in 3.05-dev in master branch on github).

A real life example: in the OCR results of Linde's dictionary I have to replace in particular 8626 occurences of Boss by Ross (an abreviation for Russian language). It seem easiest to do it on the hOCR level, which is used not only for the hidden text, but also for the Poliqarp corpus.

Thanks for your work on ocrodjvu!

JSB

jsbien commented 6 years ago

I think the issue can be closed now, as I moved its content to #27 and #28.