pkiraly / qa-catalogue

QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)
GNU General Public License v3.0
76 stars 18 forks source link

Unimarc validation #390

Closed gegic closed 6 months ago

gegic commented 6 months ago

This pull request contains changes made to support validation of the Unimarc format files.

Other than that, some changes were made to simplify certain portions of the code, mostly with the intention of reducing the amount of SonarLint warnings.

A couple of comments are left throughout the code in order to mark some questions or confusions that occurred to me while going through the code.

There are numerous refactoring in terms of:

The ValidatorCLI test is currently written only for the file unimarc.mrctxt. There are other files which were provided in a separate email, two of which are also present in this pull request (please note that the firenze.1997.mrc file is not a UNIMARC file but rather something different, most likely MARC21).

I have tried to thoroughly go through all the changes a couple of times, and I am confident that no issues will be introduced with the changes. However, I still cannot guarantee that I didn't introduce any flaws.

EDIT: I forgot to mention the following as well:

The fields in the 5-- RELATED TITLE BLOCK got a makeover in the 2023 version of the UNIMARC manual and they were quite different in the 2008 version. I added the ones that were missing in the 2008 version, as there were no conflicts in that case, but there are some fields which got changed (names, subfields) which makes it difficult to discern which of the fields to include in the schema (probably both versions somehow). I would be thankful if I could get guidance in that regard.

nichtich commented 6 months ago

I only looked at avram-unimarc.json (by the way avram-unimarc-no-remarks.json should be removed) how it conforms to Avram specification. Four things to fix:

  1. What do we do with the // remark comments? Shall I add a comment key to Avram (see issue https://github.com/gbv/avram/issues/49)?
  2. The # is no actual code but a placeholder for the space (0x20) for human readers. The schema should not use # but a plain space instead.
  3. There is a position overlap in subfield 115$b: 09-12 overlaps with position 09-14
  4. Subfield 146$h and 146$i position 00-02 have wrong position value end (should be 2 instead of 1)
gegic commented 6 months ago

Thank you for taking a look. I apologize I haven't noticed the updated version of the avram-js package.

  1. The idea behind the // remark keys in the avram-unimarc.json file is to gradually remove them as each of those gets addressed and discussed, so I'm not sure if it would be necessary to add a separate comment key in this case, given that they're supposed to be temporary.
  2. I can replace the # placeholder with a space, but that still has to be addressed, as different characters are used. I don't know if ' ' should be the only allowed one, but there's a comment regarding that here. Nevertheless, I'll be replacing the # symbol given your instruction.
  3. In 115$b, there's also an overlap between the positions 09-14 and 13-14. That's why I left a remark there with the hope all the remarks would be noticed and addressed. In the older version of UNIMARC manual, all three positions were specified, whereas in the 2023 one, only 09-14 is there. I'm not sure which ones to use, as the separated ones would provide a bit more context, but the combined one is used in the most recent version of the manual.
  4. Changed. The changes will be committed a bit later, as there will likely be some more change requests related to the code.

In addition to this, something that I forgot to mention in the pull request description is the following:

The fields in the 5-- RELATED TITLE BLOCK got a makeover in the 2023 version of the UNIMARC manual and they were quite different in the 2008 version. I added the ones that were missing in the 2008 version, as there were no conflicts in that case, but there are some fields which got changed (names, subfields) which makes it difficult to discern which of the fields to include in the schema (probably both versions somehow). I would be thankful if I could get guidance in that regard.

nichtich commented 6 months ago

Thanks for the quick reply.

I added the ones that were missing in the 2008 version, as there were no conflicts in that case, but there are some fields which got changed (names, subfields) which makes it difficult to discern which of the fields to include in the schema (probably both versions somehow). I would be thankful if I could get guidance in that regard.

What's the benefit of having two conflicting definitions of the same field in one schema? I'd prefer one current schema and additional historical schema(s) but opinions and use cases may differ.

pkiraly commented 6 months ago

1) Resolving a comment might take a lot of time (e.g. we need to ask help from metadata experts, or check additional documents). Commenting the schema element would be quite useful in several cases:

pkiraly commented 6 months ago

@gegic Please remove the current src/test/resources/unimarc/bnr.1993.mrc and src/test/resources/unimarc/firenze.1977.mrc. If you can please select 10 records from each and remove the rest of the records from the files.

gegic commented 6 months ago

In accordance with the suggestions provided by @nichtich, I've now removed the // remark key. After having a closer look at the 2023 version of the manual, some of the comments are resolved. For the rest, I appended a // comment phrase to the corresponding labels. This way, the avram schema is valid and there's no need for introduction of a separate key (at least for now).

I removed the blank # code as indicated, but the issue here is still present. In this case, for the file unimarc.mrctxt it produces an additional invalid value issues such as: invalid code for 'Extent of Applicability': '#' at position 01 in 'i#'. It seems that #, |, ^ are arbitrarily used within different UNIMARC catalogues to indicate a blank value.

The bnr.1993.mrc and firenze.1977.mrc files are now trimmed so that they contain only ten records. However, the issue for firenze.1977.mrc is that it most likely isn't UNIMARC. I haven't taken such a close look, but it is more likely to contain MARC21 records rather than UNIMARC. For instance, this is one of the many issues when validating it as UNIMARC invalid code for 'Target Audience Code': 'De Ruggiero, Guido' position 17-19 out of range. Aside from containing the field structure more akin to MARC21, I also observed that the validation produces much fewer issues when validating it as MARC21 compared to when doing it for UNIMARC.