impresso / impresso-text-acquisition

🛠️ Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.
https://impresso.github.io/impresso-text-acquisition/
GNU Affero General Public License v3.0
7 stars 2 forks source link

ONB importer #120

Open e-maud opened 1 year ago

e-maud commented 1 year ago

High-level issue on ONB hOCR file converstion to canonical.
Partially Depends on milestone :triangular_flag_on_post:: ONB Acquisition.

piconti commented 11 months ago

Update on the progress for the ONB importer.

A first version of the ONB importer Alto -> Canonical has been implemented to handle all the ANNO data.

In order to have a better idea of the possibilities regarding the ANNOP data, which is in hOCR format, a few hOCR -> Alto converters have been tested on a small sample of data. It's worth noting that the source data (sample esj/1772/0057) does seem to have some irregularities in its formatting or contents. In particular, none of the converters tried worked whenever the source file contained the characters &shy; between two <span> separators. Eg: <span class='ocrx_word' title='bbox 851 1642 933 1682;x_wconf 28'>’110/2</span>&shy;</span><span class='ocr_line' title='bbox 141 1704 953 1790;x_wconf 43'>

The converters tested were the following:

Overall, it seems easier and more reliable to directly implement another ONB importer performing hOCR -> Canonical, as the hOCR syntax is relatively simple, especially with slighlty irregular data.