viaacode / metadata-quality-assessment

Wrapper application for Peter Kiraly's metadata quality API to use meemoo data.
MIT License
2 stars 1 forks source link

Adding --recordAddress option to split XML or JSON files to smaller units (records) #2

Open pkiraly opened 2 years ago

pkiraly commented 2 years ago

@mielvds I have created an XML reader and a new option --recordAddress which is an XPath expression to address the individual records. I created a reader and writer package and moved relevant classes there.

Here is an example for usage,

Input file:

<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:europeana="http://www.europeana.eu/schemas/ese/"
          xmlns:dcterms="http://purl.org/dc/terms/"
          xmlns="http://www.openarchives.org/OAI/2.0/"
          xmlns:doc="http://www.lyncode.com/xoai"
          xmlns:dc="http://purl.org/dc/elements/1.1/">
<record>
  <dc:format>application/pdf</dc:format>
  <dc:identifier type="providerId">99900556</dc:identifier>
  <dc:identifier type="providerItemId">M.ch.f.91</dc:identifier>
  <dc:identifier type="URN">urn:nbn:de:bvb:20-mchf91-3</dc:identifier>
  <dc:type type="document">Einfache Handschrift</dc:type>
  <dc:date xml:lang="de">1391</dc:date>
  <dc:date xml:lang="de">1410</dc:date>
  <dcterms:created xml:lang="de">1391-1410 (14./15. Jahrhundert)</dcterms:created>
  <dcterms:location resource="http://d-nb.info/gnd/4067037-5">Würzburg</dcterms:location>
  <dc:title>Lectura super quinto libro Decretalium</dc:title>
  ...
</record>
<record>
  <dc:format>application/pdf</dc:format>
  <dc:identifier type="providerId">99900556</dc:identifier>
  <dc:identifier type="providerItemId">I.t.f.CCLXVI</dc:identifier>
  <dc:identifier type="URN">urn:nbn:de:bvb:20-itfcclxvi-3</dc:identifier>
  <dc:type type="document" resource="http://d-nb.info/gnd/4027041-5">Inkunabel</dc:type>
  <dc:date xml:lang="de">1476</dc:date>
  ...
</record>
</metadata>
./mqa --schema dc-schema.yaml \
      --input sample.xml \
      --recordAddress '//oai:record' \
      --output result.csv \
      --measurements measurements.json \
      --outputFormat csv

The XPath should contain qualified elements, and the namespace prefix should be part of the schema:

format: xml
fields:
  ...

namespaces:
  doc: http://www.lyncode.com/xoai
  foaf: http://xmlns.com/foaf/0.1/
  europeana: http://www.europeana.eu/schemas/ese/
  dcterms: http://purl.org/dc/terms/
  dc: http://purl.org/dc/elements/1.1/
  oai: http://www.openarchives.org/OAI/2.0/
mielvds commented 2 years ago

very nice!