Unimarc validation - Githubissues

gegic commented 6 months ago

This pull request contains changes made to support validation of the Unimarc format files.

Other than that, some changes were made to simplify certain portions of the code, mostly with the intention of reducing the amount of SonarLint warnings.

A couple of comments are left throughout the code in order to mark some questions or confusions that occurred to me while going through the code.

There are numerous refactoring in terms of:

Introducing new methods in order to reduce complexity of the existing ones;
Introducing if guard statements from which a function is immediately exited in order to reduce nesting
Other minor changes which remove some SonarLint warnings

The ValidatorCLI test is currently written only for the file unimarc.mrctxt. There are other files which were provided in a separate email, two of which are also present in this pull request (please note that the firenze.1997.mrc file is not a UNIMARC file but rather something different, most likely MARC21).

I have tried to thoroughly go through all the changes a couple of times, and I am confident that no issues will be introduced with the changes. However, I still cannot guarantee that I didn't introduce any flaws.

EDIT: I forgot to mention the following as well:

The fields in the 5-- RELATED TITLE BLOCK got a makeover in the 2023 version of the UNIMARC manual and they were quite different in the 2008 version. I added the ones that were missing in the 2008 version, as there were no conflicts in that case, but there are some fields which got changed (names, subfields) which makes it difficult to discern which of the fields to include in the schema (probably both versions somehow). I would be thankful if I could get guidance in that regard.

nichtich commented 6 months ago

I only looked at avram-unimarc.json (by the way avram-unimarc-no-remarks.json should be removed) how it conforms to Avram specification. Four things to fix:

What do we do with the // remark comments? Shall I add a comment key to Avram (see issue https://github.com/gbv/avram/issues/49)?
The # is no actual code but a placeholder for the space (0x20) for human readers. The schema should not use # but a plain space instead.
There is a position overlap in subfield 115$b: 09-12 overlaps with position 09-14
Subfield 146$h and 146$i position 00-02 have wrong position value end (should be 2 instead of 1)

gegic commented 6 months ago

Thank you for taking a look. I apologize I haven't noticed the updated version of the avram-js package.

The idea behind the // remark keys in the avram-unimarc.json file is to gradually remove them as each of those gets addressed and discussed, so I'm not sure if it would be necessary to add a separate comment key in this case, given that they're supposed to be temporary.
I can replace the # placeholder with a space, but that still has to be addressed, as different characters are used. I don't know if ' ' should be the only allowed one, but there's a comment regarding that here. Nevertheless, I'll be replacing the # symbol given your instruction.
In 115$b, there's also an overlap between the positions 09-14 and 13-14. That's why I left a remark there with the hope all the remarks would be noticed and addressed. In the older version of UNIMARC manual, all three positions were specified, whereas in the 2023 one, only 09-14 is there. I'm not sure which ones to use, as the separated ones would provide a bit more context, but the combined one is used in the most recent version of the manual.
Changed. The changes will be committed a bit later, as there will likely be some more change requests related to the code.

In addition to this, something that I forgot to mention in the pull request description is the following:

The fields in the 5-- RELATED TITLE BLOCK got a makeover in the 2023 version of the UNIMARC manual and they were quite different in the 2008 version. I added the ones that were missing in the 2008 version, as there were no conflicts in that case, but there are some fields which got changed (names, subfields) which makes it difficult to discern which of the fields to include in the schema (probably both versions somehow). I would be thankful if I could get guidance in that regard.

nichtich commented 6 months ago

Thanks for the quick reply.

Either the schema is invalid or we have to remove the // remark keys. It depends on the expected duration of "temporary". A comment key may also be helpful for other schema authors. When do you expect to have resolved all comments and will there be additional in the future?
Indicators can only be a-z, 0-9 and space qa-catalogue internally uses space as well. Every other "special" character for blank indicators must be mapped to space as part of record parsing, so please change # to space in the Schema.
Which fields to include (2023 version or 2008 version or both)? There are three possibilities
- a current-only schema (2023 version) to check newly created records and a historical schema (2008 version)
- a current-only schema (2023 version) with historical fields marked as deprecated (except for conflicting fields, those must be removed). We are discussing deprecated elements at #391.
- a schema mixing both versions (2023 plus some fields/subfields/codes from 2008 version) as you did, for validating all existing records

I added the ones that were missing in the 2008 version, as there were no conflicts in that case, but there are some fields which got changed (names, subfields) which makes it difficult to discern which of the fields to include in the schema (probably both versions somehow). I would be thankful if I could get guidance in that regard.

What's the benefit of having two conflicting definitions of the same field in one schema? I'd prefer one current schema and additional historical schema(s) but opinions and use cases may differ.

pkiraly commented 6 months ago

1) Resolving a comment might take a lot of time (e.g. we need to ask help from metadata experts, or check additional documents). Commenting the schema element would be quite useful in several cases:

when the source or sources are not clear, and further investigation are needed to understand the situation
when the Avram schema does not support a given structure A comment would call the developer's attention that in the future more work is needed at that point. So I would vote for creating a comment element in Avram, since JSON itself (unlike XML) does not have a schema independent commenting mechanism. 2) the Avram reader class handles the '#' indicator and control field values, but I agree that it would be better to use spaces, because that would make the schema independent from QA catalogue, so other tools could make use of it.

pkiraly commented 6 months ago

@gegic Please remove the current src/test/resources/unimarc/bnr.1993.mrc and src/test/resources/unimarc/firenze.1977.mrc. If you can please select 10 records from each and remove the rest of the records from the files.

gegic commented 6 months ago

In accordance with the suggestions provided by @nichtich, I've now removed the // remark key. After having a closer look at the 2023 version of the manual, some of the comments are resolved. For the rest, I appended a // comment phrase to the corresponding labels. This way, the avram schema is valid and there's no need for introduction of a separate key (at least for now).

I removed the blank # code as indicated, but the issue here is still present. In this case, for the file unimarc.mrctxt it produces an additional invalid value issues such as: invalid code for 'Extent of Applicability': '#' at position 01 in 'i#'. It seems that #, |, ^ are arbitrarily used within different UNIMARC catalogues to indicate a blank value.

The bnr.1993.mrc and firenze.1977.mrc files are now trimmed so that they contain only ten records. However, the issue for firenze.1977.mrc is that it most likely isn't UNIMARC. I haven't taken such a close look, but it is more likely to contain MARC21 records rather than UNIMARC. For instance, this is one of the many issues when validating it as UNIMARC invalid code for 'Target Audience Code': 'De Ruggiero, Guido' position 17-19 out of range. Aside from containing the field structure more akin to MARC21, I also observed that the validation produces much fewer issues when validating it as MARC21 compared to when doing it for UNIMARC.

pkiraly / qa-catalogue

Unimarc validation #390