impresso / impresso-text-acquisition

Python library to import OCR data in various formats into the canonical JSON format defined by the Impresso project.
https://impresso.github.io/impresso-text-acquisition/
GNU Affero General Public License v3.0
7 stars 2 forks source link

KB importer #123

Open piconti opened 7 months ago

piconti commented 7 months ago

Implement the KB importer which is in DIDL-ALTO format, given the sample data provided.

piconti commented 7 months ago

Update after the first implementation of the KB importer.

The main functions in kb.detect.py and kb.classes.py have been implemented and work on the provided samples. However, during the implementation some specificities to KB's format (in particular the Didl format) have been identified. Some of them might be the object of further questions to KB as to ensure the importer is ready and robust enough for larger scale data. Additionally, others will require adjustments once more information is available, and can be subject to discussion on how we should handle them.

These specificities are the following:

piconti commented 5 months ago

We have a response from KB.

TODO as a result: