Integrate data from a translation correlation list, excluded from the automatic integration, but enriched with local data

SvenLieber commented 11 months ago

For some translations we want to prioritize manual curated data, similar as for contributors (https://github.com/kbrbe/beltrans-data-integration/issues/176).

However, for translations we want a more inclusive approach as for contributors. Meaning that we only provide some basic information in the correlation list for translations and add more information from the data sources via the schema:sameAs link as necessary. For contributors the approach is exclusive, only the values provided in the correlation list are used. But in both cases the curation list entries should be excluded from the automatic integration!

[x] Implement an ETL process of a translation correlation list
[x] Adapt the data integration SPARQL queries if necessary
[x] write a software integration test to ensure correct results without side effects

SvenLieber commented 10 months ago

The software integration test added with the commit above verifies that the translation correlation list approach works if data is correctly added.

However, in the most recent data integration the correlation list entries were added to the target graph, but the automatic data integration did not take it into account.

There is some corrupted data, as seen in the screenshot below, the KBR identifier is taken for all bf:identifiedBy relationships, also the rdf:value of the linked entity has the KBR identifier for BnF, KB and Unesco.

This could explain why even though the correlation list entry was added, the automatic integration still added local records for them: there was no link and hence it could not be detected that there exists already a record from the correlation list.

There are several issues:

The mapping file contains wrong URI patterns (authority instead of manifestation)
Wrong extraction of identifiers from the correlation list (see screenshot)
multiple columns with the same name in the correlation list

https://github.com/kbrbe/beltrans-data-integration/blob/a42daf6594a68106fc8a93e4b786ba41ec1010a2/data-integration/integrate-data.sh#L2286-L2294

the last 4 rows of this snippet extract the column targetKBRIdentifier, but it should also extract the targetBnFIdentifier, targetKBIdentifier and targetUnescoIdentifier. Hence the files that should contain the related BnF, KB and Unesco identifiers all use the KBR identifier.

Additionally, the translation correlation list contained some columns multiple times, e.g. the column targetKBIdentifier existed one time with values and another time without values, the latter was taken for the mapping and hence there were not KB identifiers.

SvenLieber commented 10 months ago

There are still two issues:

[x] source and target language were missing, after adding them there should be also mapping rules for it
[x] if there is a source KBR identifier, the source data should be fetched via API and afterwards put through the KBR ETL pipeline such that we also have the source information in the BELTRANS KG

kbrbe / beltrans-data-integration

Integrate data from a translation correlation list, excluded from the automatic integration, but enriched with local data #190