kbrbe / beltrans-data-integration

Creating a FAIR Linked Data corpus for the BELTRANS research project about Belgian book translations NL-FR and FR-NL between 1970 and 2020
https://www.kbr.be/en/projects/beltrans/
MIT License
5 stars 0 forks source link

Systematically identify original works #129

Open SvenLieber opened 2 years ago

SvenLieber commented 2 years ago

Different data sources indicate the original title of a translation as text. For example KBR using the MARC field 765$t and KB using the RDF property schema:translationOfWork/rdfs:label. However, for actual analysis on the translation context we need structured data about the original, not just its title, thus we want the catalogue identifier of the original. (Please note that sometimes an explicit link to the source identifier is provided, but this does not occur very often).

We already started to identify source titles using different techniques, but we are currently missing a systematic step in our pipeline. For instance, the script find-originals.py from our intern Fabrizio Pascucci could be used in an iterative way for different input data.

This issue will be updated with possible tasks to systematically integrate the identification of original works.

KBR

For KBR data we can export all books metadata from the corpus authors and perform a title comparison. Sometimes only a single match is found which is the ideal case. It could be that several candidates are found, same title but different KBR ID. If the publication year of one of the found originals is newer than of the translation (because it is a reprint of the original), we can filter that out.

However, if there are several matches to various editions, there is no safe automatic way to determine based on the available metadata, on which edition the translation is based. This would need to be encoded in the translation, but for the considered records the translation only mentions the title of the original.

BnF

KB

SvenLieber commented 2 years ago

Status translation-original matching KBR

The original version of the matching script created a lookup table with original title as key and KBR ID as value. However, there can be several identifiers for a book with the same title, therefore the script was adapted to show several possible identifiers. Additionally, the output was split into several files to take different levels of matching into account:

UPDATE: when applying a filter based on publication year, i.e. filtering out candidates from which the publication year of the original is after the year of the translation

SvenLieber commented 2 years ago

Status translation-original matching KBR

After processing single exact title matches and title matches with a similarity bigger than 0.9 (adding schema:translationOfWork links), we have the following changes:

image