kbrbe / beltrans-data-integration

Creating a FAIR Linked Data corpus for the BELTRANS research project about Belgian book translations NL-FR and FR-NL between 1970 and 2020
https://www.kbr.be/en/projects/beltrans/
MIT License
5 stars 0 forks source link

The column targetYearOfPublication does no longer show conflicting dates strings such as "2013 or 2014" #255

Closed SvenLieber closed 3 months ago

SvenLieber commented 5 months ago

We integrate data about translations from different sources. In certain cases different translations that are identified as the same via a common ISBN identifier have conflicting dates. We query the date of each data source with SPARQL and show it in separate columns (targetYearOfPublicationKBR, targetYearOfPublicationBnF, etc). A postprocessing with Python creates a new general column that -- if there is a date mismatch -- contains a string such as 2013 or 2014 => targetYearOfPublication.

For cases where the translation editions really are the same and simply one of the dates is wrong, we added the possibility to signal the correct year via our translation correlation list (https://github.com/kbrbe/beltrans-data-integration/issues/201). In short: the SPARQL query to create the Excel takes the year of the integrated RDF record for each data source if there is one, otherwise take the individual years.

However, for analytical purposes we also wanted to add the year of publication to all integrated RDF data records (and thus not only via postprocessing step to the Excel or via correlation list) https://github.com/kbrbe/beltrans-data-integration/issues/245

This resulted in a double post processing where we first add years to the integrated records (including conflicting strings such as 2013 or 2014) and then take those years instead of the individual years. The second post-processing then is confronted with invalid date values (containing "or") and hence no date is shown at all!

There was the wrong assumption that if there is a year in the integrated data record, it must come from the manual curation list. Unfortunately it also came in bulk for all data via the new integration step that added years of publication to all integrated records.

To fix this, we have to adapt the postprocessing step to ignore the dates, as they are already processed and consolidated during integration.

SvenLieber commented 3 months ago

fixed in https://github.com/kbrbe/beltrans-data-integration/commit/a1a79e799b07a2e0566ec9f268d1eab5810a2c61