KB importer - Githubissues

piconti commented 7 months ago

Implement the KB importer which is in DIDL-ALTO format, given the sample data provided.

piconti commented 7 months ago

Update after the first implementation of the KB importer.

The main functions in kb.detect.py and kb.classes.py have been implemented and work on the provided samples. However, during the implementation some specificities to KB's format (in particular the Didl format) have been identified. Some of them might be the object of further questions to KB as to ensure the importer is ready and robust enough for larger scale data. Additionally, others will require adjustments once more information is available, and can be subject to discussion on how we should handle them.

These specificities are the following:

File structure:
- In the provided sample, the files were not separated by journal (only by year > month > day > issue_identifier). An index .tsv file was provided to link each journal to the paths of the issues present and their publication date.
- While having a top-level directory separating the data from each journal would be ideal (journal > year > month > day > issue_identifier), the present filestructure can work at larger scale, as long as that a similar index .csv or .tsv file is provided. Indeed, the current detect_issues function uses of this index to identify which issues belong to which journal and to filter the issues to import.
Journal/Title Aliases or ids
- Most of the data providers used some sort of aliases or human-readable IDs for their various titles, but none have been communicated or found for kb yet.
- Currently, a function mapping each journal to an 8-digit ID has been implemented. however, since these IDs do not convey any information about the journal, and that KB's collection is comprised of arount 1000 journals, it cannot be a viable long-term solution.
- If KB has an internal human-readable alias system, we could use it. Otherwise we could develop an alias system of our own, but the journal titles have a large variety of formats, and some are very similar to each other. As a result finding a systematic approach that would generate unique aliases could prove tricky.
Segmented Images
- After multiple attempts, I was not able to find any segmented area corresponding to illsutrations or images in the current samples. Only a few illustrations appear on the pages, but no corresponding item was found in the didle or Alto files.
- We will need to consult KB about this to ask if their OLR also segments images. If yes, we might also ask for an example of more illustrated issue to write the code handling images.
Content-item ordering
- In KB's OLR, all the segmented items or articles are numbered at the issue's level. This numbering has been used for now to number the created content items, but it appears that they can be shuffled and not follow a logical page-to-page ordering.
- Another issue (#74) is already on this subject, and an approach to add a reading order to all canonical data could be a solution, which is still to be worked on.
Content item types
- We currently have a list of content_item_types that we use in the canonical format. In KB's data some items are classified as "Familial message" (announcements of weddings, wedding anniversaries, or parties).
- This type could be added to the current list of content_item_types if it's found to be relevant.

piconti commented 5 months ago

We have a response from KB.

They don't use aliases, so we should create a list of aliases.
There are very few images in the data, once we have a larger dump, I can look for them in the data
They have agreed to provide the new dump in the provided file structure

TODO as a result:

[ ] Create mapping of aliases for KB
[ ] Adapt code to new file structure of data
[ ] Find images/illustrations in the data and implement specific code to handle them
[ ] Finalize importer code based on pilot
[ ] Comment & document
[ ] Merge into master

impresso / impresso-text-acquisition

KB importer #123