qurator-spk / mods4pandas

Extract the MODS/ALTO metadata of a bunch of METS/ALTO files into pandas DataFrames for data analysis
Apache License 2.0
11 stars 0 forks source link

Missing subject/topic, genre #24

Open mikegerber opened 1 year ago

mikegerber commented 1 year ago

From the feedback by Maria Federbusch:

PPN813655765 Zusätzlich fehlen in diesem Beispiel (MONO) noch die Schlagwörter - hier aus dem ARMA-Projekt:

<mods:subject authority="getty">
<mods:topic valueURI="http://vocab.getty.edu/aat/300411614">reading culture</mods:topic>
<mods:topic valueURI="http://vocab.getty.edu/aat/300020756">Medieval (European)</mods:topic>
</mods:subject>
<mods:subject authority="wikidata">
<mods:topic valueURI="https://www.wikidata.org/wiki/Q107274053">Reading culture (medieval)</mods:topic>
</mods:subject>
<mods:genre valueURI="https://www.wikidata.org/wiki/Q1261026" type="class" authority="wikidata">
<mods:genre>printed matter</mods:genre>
</mods:genre>
mikegerber commented 1 year ago

The thing I am thinking about here is

  1. how to encode both the IDs and the textual representation. Considering two arrays (not sets)
  2. if the topics/genre are ordered.

If we use two arrays (nested CSV basically), we also don't need to worry if there's an order because we just use the one from the file.

mikegerber commented 1 year ago

Relatively sure that we're going to translate this twice: once the IDs, and once the text, in the same order.

For mods:genre I asked Maria again because this example above looks incorrect and should be without the second mods:genre:

<mods:genre valueURI="https://www.wikidata.org/wiki/Q1261026" type="class" authority="wikidata">
printed matter
</mods:genre>

DFG MODS Anwendungsprofil Abschnitt 2.3.1 also looks like my corrected version. Haven't checked the XML Schema yet.

mikegerber commented 1 year ago

The nested mods:genre is definitely incorrect, but I'm considering correcting it while the source files have this flaw.