pkiraly / qa-catalogue

QA catalogue – a metadata quality assessment tool for library catalogue records (MARC, PICA)
GNU General Public License v3.0
77 stars 17 forks source link

parsing machine readable UNIMARC and creating a schema object #307

Open pkiraly opened 1 year ago

pkiraly commented 1 year ago

The component reads the machine readable UNIMARC schema and creates a schema object, similar to the schema reader that reads the Avram representation of PICA.

Parent: #305

nichtich commented 1 year ago

I'd prefer to transform the UNIMARC schema to Avram and use Avram for both PICA and UNIMARC. I'll start with a transformation.

pkiraly commented 1 year ago

@nichtich Please do not start that yet. 1) I would like to ask a student who will do this as part of a thesis 2) I am in a discussion with an UNIMARC expert, because it seems that this machine readable version contains information only about subfields, but not about fields, and indicators - so it seems that the process requires manual work as well i.e. reading UNIMARC's PDF documentation. At the moment all UNIMARC related tickets are in a preparation state, not yet ready for coding work.

nichtich commented 1 year ago

Ok, I also found out that the machine-readable documentation is incomplete. Here is a jq script to extract Avram-compatible records but post-processing is required to merge fields, indicator codes and subfield schedules anyway. The student can reuse, compare or ignore this piece of code.

for n in 0XX 1XX 2XX 3XX 41X 42X 43X 44X 45X 46X 47X 48X 5XX 60X 61X 62X 66X 67X 7XX 801 702 830 850 856 886; do
    curl -s http://iflastandards.info/ns/unimarc/unimarcb/elements/$n.jsonld | jq -f jsonld2avram.jq
done
def remove_nulls: del(..|nulls);
def parse_id: .["@id"]|split("/")[-1];

.["@graph"]
| map(
    select(.status.label=="Published") |    # only published elements
    select(parse_id|.!="")                  # omit element sets
  )
|
.[]
| parse_id as $id
| $id[1:4] as $tag  # field
| $id[4:5] as $ind1 # indicator1
| $id[5:6] as $ind2 # indicator2
| $id[6:7] as $code # subfield code
| {
  $tag,
  indicator1: (if $ind1!="_" then { codes: {($ind1):""} } else null end),
  indicator2: (if $ind2!="_" then { codes: {($ind2):""} } else null end),
  subfields: {
    ($code): ({
      $code,
      label: .label.en,
      description: .description.en[0],
      url: .url
      # .note is ignored as it has no counterpart in Avram
    } | remove_nulls)
  }
} | remove_nulls
pkiraly commented 1 year ago

Great, thanks a lot!