pkiraly / qa-catalogue

QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)
GNU General Public License v3.0
78 stars 17 forks source link

Unimarc authority #440

Closed gegic closed 6 months ago

gegic commented 6 months ago

This pull request consists mainly of changes related to the integration of UNIMARC authority name analysis.


One part that I would like to be validated more closely and about which I'm not completely sure is the following:

Due to the nature of the UNIMARC format, there isn't such a thing as Source of heading or term (subfield $2), which was used for the analysis of MARC21. In addition, I have analyzed the available catalogues and records, as well as some additional records online, such as the ones available at the portals of Bibliothèque nationale de France or catalogues with modified formats such as COBISS+ from IZUM, Slovenia and I wasn't able to find anything that would resemble something indicating a schema source in any of the checked records (nor formats).

Therefore, all fields from the UNIMARC authority analysis are handled as:

Schema unhandledSchema = new Schema(field.getTag(), "$2", "UNSPECIFIED", "UNSPECIFIED");

which in turn renders the authorities-by-schema output quite useless. While some of the used fields do have the $2 Source subfields, they are mostly actually:

An identification in coded form for the relator code schema from which the code in $4 is derived, when the code is not from UNIMARC Relator Codes. Not repeatable.

Any advice on this would be greatly appreciated.

The fields used for the UNIMARC authority analysis (groups suggested by @pkiraly) are:

pkiraly commented 6 months ago

Dear @gegic

I just checked 700, 701, 702, 710, 711, 712 and they have $2 that could be used:

        "2": {
          "label": "Source",
          "repeatable": false
        },

In some other fields (5xx) there is something else, which needs investigation:

        "2": {
          "label": "System Code",
          "repeatable": false
        },

In several cases it contains "SIPOR", which is a Portuguese classification/authority dictionary: https://www.bnportugal.gov.pt/index.php?option=com_content&view=article&id=484&Itemid=531&lang=en.

So use $2 as the default settings, and later we could adjust by consulting with the UNIMARC experts.

gegic commented 6 months ago

Dear @pkiraly, thank you for the answer :D

I had already previously checked those points but wasn't quite sure whether to take them into consideration because of the following:

  1. For the 7-- block, this is the description of the $2 subfield: An identification in coded form for the relator code schema from which the code in $4 is derived, when the code is not from UNIMARC Relator Codes. Not repeatable. So that seems to be rather something else.
  2. For the block 5--, the description is this: An identification in coded form of the system from which the subject heading is derived. This subfield is used only when the 500 field is embedded in a 604 field. For examples see field 604. Not repeatable. This seems to refer to having the field 604 $1500 $2...., where $1 embeds 500 into 604.
  3. I also thoroughly (programmatically) checked the Portuguese catalogue, and SIPOR is used as a value of the subfield $2 only for the 6-- block, never for 5-- or 7--.

Should I nevertheless use the subfield $2 to parse the scheme?

pkiraly commented 6 months ago

Dear @gegic,

you are right that $2 is not what we are really looking for but for the time being please use it if it is available (usually it is not available).

gegic commented 6 months ago

Dear @pkiraly,

Thank you for the answers.

I have now added that part akin to the one extracting the source in the MARC21 analyzer. In addition, I also refactored some few more things.

I suppose the PR is reviewable now :)