Open SvenLieber opened 2 years ago
The original version of the matching script created a lookup table with original title as key and KBR ID as value. However, there can be several identifiers for a book with the same title, therefore the script was adapted to show several possible identifiers. Additionally, the output was split into several files to take different levels of matching into account:
UPDATE: when applying a filter based on publication year, i.e. filtering out candidates from which the publication year of the original is after the year of the translation
After processing single exact title matches and title matches with a similarity bigger than 0.9 (adding schema:translationOfWork
links), we have the following changes:
8458
translations (7553
with KBR identifier) we have 189
identified sources (40
more than before)4019
translations (3535
with KBR identifier) we have 212
identified sources (18
more than before)
Different data sources indicate the original title of a translation as text. For example KBR using the MARC field
765$t
and KB using the RDF propertyschema:translationOfWork/rdfs:label
. However, for actual analysis on the translation context we need structured data about the original, not just its title, thus we want the catalogue identifier of the original. (Please note that sometimes an explicit link to the source identifier is provided, but this does not occur very often).We already started to identify source titles using different techniques, but we are currently missing a systematic step in our pipeline. For instance, the script find-originals.py from our intern Fabrizio Pascucci could be used in an iterative way for different input data.
This issue will be updated with possible tasks to systematically integrate the identification of original works.
KBR
For KBR data we can export all books metadata from the corpus authors and perform a title comparison. Sometimes only a single match is found which is the ideal case. It could be that several candidates are found, same title but different KBR ID. If the publication year of one of the found originals is newer than of the translation (because it is a reprint of the original), we can filter that out.
However, if there are several matches to various editions, there is no safe automatic way to determine based on the available metadata, on which edition the translation is based. This would need to be encoded in the translation, but for the considered records the translation only mentions the title of the original.
BnF
KB