kbrbe / beltrans-data-integration

Creating a FAIR Linked Data corpus for the BELTRANS research project about Belgian book translations NL-FR and FR-NL between 1970 and 2020
https://www.kbr.be/en/projects/beltrans/
MIT License
4 stars 0 forks source link

Integrate data from a translation correlation list, excluded from the automatic integration, but enriched with local data #190

Closed SvenLieber closed 9 months ago

SvenLieber commented 11 months ago

For some translations we want to prioritize manual curated data, similar as for contributors (https://github.com/kbrbe/beltrans-data-integration/issues/176).

However, for translations we want a more inclusive approach as for contributors. Meaning that we only provide some basic information in the correlation list for translations and add more information from the data sources via the schema:sameAs link as necessary. For contributors the approach is exclusive, only the values provided in the correlation list are used. But in both cases the curation list entries should be excluded from the automatic integration!

SvenLieber commented 10 months ago

The software integration test added with the commit above verifies that the translation correlation list approach works if data is correctly added.

However, in the most recent data integration the correlation list entries were added to the target graph, but the automatic data integration did not take it into account.

There is some corrupted data, as seen in the screenshot below, the KBR identifier is taken for all bf:identifiedBy relationships, also the rdf:value of the linked entity has the KBR identifier for BnF, KB and Unesco.

image

This could explain why even though the correlation list entry was added, the automatic integration still added local records for them: there was no link and hence it could not be detected that there exists already a record from the correlation list.

There are several issues:

https://github.com/kbrbe/beltrans-data-integration/blob/a42daf6594a68106fc8a93e4b786ba41ec1010a2/data-integration/integrate-data.sh#L2286-L2294

the last 4 rows of this snippet extract the column targetKBRIdentifier, but it should also extract the targetBnFIdentifier, targetKBIdentifier and targetUnescoIdentifier. Hence the files that should contain the related BnF, KB and Unesco identifiers all use the KBR identifier.

Additionally, the translation correlation list contained some columns multiple times, e.g. the column targetKBIdentifier existed one time with values and another time without values, the latter was taken for the mapping and hence there were not KB identifiers.

SvenLieber commented 10 months ago

There are still two issues: