Handle different "variants" of BnF identifiers

SvenLieber commented 2 years ago

URIs in the BnF data contain an ark identifier. We extract that identifier and store it using dcterms:identifier, for example cb11896963c for the Belgian author Hugo Claus. However, other data sources such as data obtained from the ISNI SRU API, refer to the "regular" identifier of BnF, for the previous example 11896963.

The ark-based identifier often has an added letter or number. Currently we only store the ark-based identifier such that we can build the BnF URIs using the pattern http://data.bnf.fr/ark:/12148/ + identifier if needed. However, we need to store both identifiers to be able to take links to other data sources into account, such as ISNI SRU data. Therefore we likely need two different attributes, or need the information how to "convert" the different variants of identifiers. We should check the documentation.

Two examples from Belgian writers:

Stefan Hertmans

https://data.bnf.fr/en/12075075/stefan_hertmans/ =>12075075
https://catalogue.bnf.fr/ark:/12148/cb120750750 => cb120750750
https://data.bnf.fr/ark:/12148/cb120750750 => cb120750750

Hugo Claus

https://data.bnf.fr/en/11896963/hugo_claus/ => 11896963
https://catalogue.bnf.fr/ark:/12148/cb11896963c => 11896963c
https://data.bnf.fr/ark:/12148/cb11896963c => 11896963c

SvenLieber commented 2 years ago

Control chracter info

According to the (translated) BnF policy the last character is a control character:

The ARK identifiers assigned by the BnF contain a check character which guarantees them against isolated character errors and transposition errors.

See also general explanations here: See also explanation here: https://www.bnf.fr/fr/lidentifiant-ark-archival-resource-key

Control character computation

A (translated) pdf contains the following explanation regarding the control character:

Calculation of the check character is the responsibility of each addressing authority for the ARKs it is able to resolve. It is strongly recommended that each of them implements the calculation of the control character as described below when an ARK of its perimeter is provided to it.

The calculation of the control character relates to the ARK name (unqualified ARK). Base10 / base29 correspondence table:

xdigit: 0 1 2 3 4 5 6 7 8 9 bcdfg value: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

xdigit:hjkmnpqrstvwxz value: 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Algorithm:

Check that the string matches the pattern “[prefix][0-9bcdfghjkmnpqrstvwxz]*” For each character, multiply its value in base 10 by its position in the string, then do 2. the sum

Calculate the base 29 modulo of the previously obtained value. The control character corresponds to this modulo expressed in base 29.

Case 1: the last character of the ARK name provided as input to the addressing authority corresponds to the result of the algorithm applied to the previous characters of the ARK name => move to the “ARK name processing” step.

Case 2: the last character of the ARK name does not correspond to the result of the algorithm => "erroneous request" type error with the explanatory text "erroneous ARK: the ARK you entered ([ARK provided]) does not match not a valid ARK, please check its structure

Todo: test this computation to ensure I understood it correctly

SvenLieber commented 1 year ago

This issue can be closed, there are two possible solutions:

computing the control character ourselves using the function from the commit above or from the following library: https://github.com/kbrbe/enrich-authority-csv-via-isni/blob/e0d0a6ac38646697c161d25eec7352d20be8b87e/enrich_authority_csv_via_isni/lib.py#L10-L70
Use the public SRU API of BnF with search key aut.recordid and BnF identifier without control character as value. https://www.bnf.fr/fr/service-sru-catalogue-general-de-la-bnf

kbrbe / beltrans-data-integration

Handle different "variants" of BnF identifiers #99

Control chracter info

Control character computation