Open e-maud opened 1 year ago
Update on the progress for the ONB importer.
A first version of the ONB importer Alto -> Canonical has been implemented to handle all the ANNO
data.
In order to have a better idea of the possibilities regarding the ANNOP
data, which is in hOCR format, a few hOCR -> Alto converters have been tested on a small sample of data.
It's worth noting that the source data (sample esj/1772/0057
) does seem to have some irregularities in its formatting or contents. In particular, none of the converters tried worked whenever the source file contained the characters ­
between two <span>
separators. Eg:
<span class='ocrx_word' title='bbox 851 1642 933 1682;x_wconf 28'>’110/2</span>­</span><span class='ocr_line' title='bbox 141 1704 953 1790;x_wconf 43'>
The converters tested were the following:
Overall, it seems easier and more reliable to directly implement another ONB importer performing hOCR -> Canonical, as the hOCR syntax is relatively simple, especially with slighlty irregular data.
High-level issue on ONB hOCR file converstion to canonical.
Partially Depends on milestone :triangular_flag_on_post:: ONB Acquisition.
[x] Exploration and decision on approach
It will have to be decided what is best between first converting hOCR => canonical or passing through ALTO to benefit from already written pieces of code (hOCR => ALTO => canonical)
In case, a few links which may be useful:
[ ] First implementation on samples
[ ] Full importer test on ONB complete data