Systematically identify original works

kbrbe / beltrans-data-integration

Creating a FAIR Linked Data corpus for the BELTRANS research project about Belgian book translations NL-FR and FR-NL between 1970 and 2020

MIT License

5 stars 0 forks source link

Different data sources indicate the original title of a translation as text. For example KBR using the MARC field 765$t and KB using the RDF property schema:translationOfWork/rdfs:label. However, for actual analysis on the translation context we need structured data about the original, not just its title, thus we want the catalogue identifier of the original. (Please note that sometimes an explicit link to the source identifier is provided, but this does not occur very often).

We already started to identify source titles using different techniques, but we are currently missing a systematic step in our pipeline. For instance, the script find-originals.py from our intern Fabrizio Pascucci could be used in an iterative way for different input data.

This issue will be updated with possible tasks to systematically integrate the identification of original works.

KBR

For KBR data we can export all books metadata from the corpus authors and perform a title comparison. Sometimes only a single match is found which is the ideal case. It could be that several candidates are found, same title but different KBR ID. If the publication year of one of the found originals is newer than of the translation (because it is a reprint of the original), we can filter that out.

However, if there are several matches to various editions, there is no safe automatic way to determine based on the available metadata, on which edition the translation is based. This would need to be encoded in the translation, but for the considered records the translation only mentions the title of the original.

[x] extract books of corpus authors from the KBR catalogue backend
[x] adapt and test the matching script
[x] add pipeline step to identify the record of original works
[x] add pipeline step to ETL relevant translation such that we have their metadata in RDF
[x] automatically filter original matching candidates based on publication year
[ ] add an "originals" sheet to the Excel output containing dataprofile information about the originals

BnF

[x] ETL BnF source titles which we extracted from the BnF catalogue
[x] Add pipeline step to include bnf source titles and links from target to source
[ ] extract books of corpus authors from the BnF data dumps

KB

[ ] extract books of the corpus authors from KB

Status translation-original matching KBR

The original version of the matching script created a lookup table with original title as key and KBR ID as value. However, there can be several identifiers for a book with the same title, therefore the script was adapted to show several possible identifiers. Additionally, the output was split into several files to take different levels of matching into account:

UPDATE: when applying a filter based on publication year, i.e. filtering out candidates from which the publication year of the original is after the year of the translation

NL-FR (comparing 5,410 translations (1,450 with original title but missing original identifier) with 38,596 originals)

474 exact title matches (1 match with 1 KBR identifier)
464 exact title matches (1 match with multiple KBR identifiers)
- from which, if year filter applied: 140 with a single match, 10 without a match and 177 where the number of candidates is reduced
46 with string similarity (1 match with 1 KBR identifier)
14 with string similarity (1 match with multiple KBR identifiers)
- from which, if year filter applied: 3 with a single match, 1 without a match and 4 where the number of candidates is reduced
8 with string similarity (multiple matches with at least one with multiple KBR identifiers)

FR-NL (comparing 16,318 translations (8,392 with original title but missing original identifier) with 49,436 originals)

1,251 exact title matches (1 match with 1 KBR identifier)
1,160 exact title matches (1 match with multiple KBR identifiers)
- from which, if year filter applied: 387 with a single match, 86 without a match and 369 where the number of candidates is reduced
239 with string similarity (1 match with 1 KBR identifier)
102 with string similarity (1 match with multiple KBR identifiers)
- from which, if year filter applied: 40 with a single match, 10 without a match and 26 where the number of candidates is reduced
15 with string similarity (multiple matches with at least one with multiple KBR identifiers)

Status translation-original matching KBR

After processing single exact title matches and title matches with a similarity bigger than 0.9 (adding schema:translationOfWork links), we have the following changes:

FR-NL: from 8458 translations (7553 with KBR identifier) we have 189 identified sources (40 more than before)

NL-FR: from 4019 translations (3535 with KBR identifier) we have 212 identified sources (18 more than before)

kbrbe / beltrans-data-integration

Systematically identify original works #129

KBR

BnF

KB

Status translation-original matching KBR

Status translation-original matching KBR