Open piconti opened 7 months ago
Update after the first implementation of the KB importer.
The main functions in kb.detect.py
and kb.classes.py
have been implemented and work on the provided samples.
However, during the implementation some specificities to KB's format (in particular the Didl format) have been identified.
Some of them might be the object of further questions to KB as to ensure the importer is ready and robust enough for larger scale data.
Additionally, others will require adjustments once more information is available, and can be subject to discussion on how we should handle them.
These specificities are the following:
year > month > day > issue_identifier
). An index .tsv
file was provided to link each journal to the paths of the issues present and their publication date.journal > year > month > day > issue_identifier
), the present filestructure can work at larger scale, as long as that a similar index .csv
or .tsv
file is provided. Indeed, the current detect_issues
function uses of this index to identify which issues belong to which journal and to filter the issues to import.content_item_types
that we use in the canonical format. In KB's data some items are classified as "Familial message" (announcements of weddings, wedding anniversaries, or parties).content_item_types
if it's found to be relevant.We have a response from KB.
TODO as a result:
Implement the KB importer which is in DIDL-ALTO format, given the sample data provided.