ONB importer - Githubissues

Update on the progress for the ONB importer.

A first version of the ONB importer Alto -> Canonical has been implemented to handle all the ANNO data.

In order to have a better idea of the possibilities regarding the ANNOP data, which is in hOCR format, a few hOCR -> Alto converters have been tested on a small sample of data. It's worth noting that the source data (sample esj/1772/0057) does seem to have some irregularities in its formatting or contents. In particular, none of the converters tried worked whenever the source file contained the characters  between two  separators. Eg: ’110/2

The converters tested were the following:

ocr-fileformat
- Unsatisfactory results, with the values for height systematically missing or NaN.
- Text-style, or language information is lost in the process
- Runs using docker, either with a web interface of CLI
hOCR-to-ALTO
- Similar results to ocr-fileformat, Coordinates are also not correctly parsed.
- Text-style, or language information is lost in the process
hOCRTools
- Yields the best results of all the tested converters. The coordinates are parsed correctly.
- Text-style, or language information is lost in the process.
- Could theoretically be used, but it did not run on many of the sample pages tried, so it cannot be used at a relatively large scale like it would be necessary for us.

Overall, it seems easier and more reliable to directly implement another ONB importer performing hOCR -> Canonical, as the hOCR syntax is relatively simple, especially with slighlty irregular data.

impresso / impresso-text-acquisition

ONB importer #120