qurator-spk / mods4pandas

Extract the MODS/ALTO metadata of a bunch of METS/ALTO files into pandas DataFrames for data analysis
Apache License 2.0
11 stars 0 forks source link

Add missing information for "original" PPNs #21

Closed BibWiss closed 1 year ago

BibWiss commented 1 year ago

By now, only the content of <mods:identifier type="PPNanalog"> seems to be included in the output but not the equivalent information from

<mods:relatedItem type="original">
    <mods:recordInfo>
        <mods:recordIdentifier source="gbv-ppn">PPNxyz</mods:recordIdentifier>
    </mods:recordInfo>
</mods:relatedItem>
mikegerber commented 1 year ago

Yeah I believe I left that out because that looked like MODS in MODS :) Have to look at more data to see if that's actually true.

mikegerber commented 1 year ago

@BibWiss We should also probably ask our colleagues if one field is going to replace the other in the future, because these might be redundant. <mods:relatedItem type="original"> also looks "more modern" than <mods:identifier type="PPNanalog">.

BibWiss commented 1 year ago

I agree, the usage seems to be inconsistent and we should investigate it further. Ideally, these two fields would be merged into one.

mikegerber commented 1 year ago

Questions collected:

BibWiss commented 1 year ago

Some examples are:

E.g. within sbb-mets-PPN1041138024.xml, the corresponding identifier to its physical edition ("StaBiKat (ppn original): 310512263") is contained as:

<mods:relatedItem type="original">
    <mods:recordInfo>
            <mods:recordIdentifier source="gbv-ppn">PPN310512263</mods:recordIdentifier>
        </mods:recordInfo>
</mods:relatedItem>
mikegerber commented 1 year ago

Newest mods4pandas now exports this info in new columns, e.g. here in relatedItem-original_recordInfo_recordIdentifier for my test data:

PPN1678618276             None
PPN1727545451     PPN167755803
PPN1737752050             None
PPN1769395962    PPN1769388664
PPN3348760607             None
PPN773555676      PPN537331794
Name: relatedItem-original_recordInfo_recordIdentifier, dtype: object
mikegerber commented 1 year ago

See also #22, the are a number of items which have dnb-ppn (and we don't handle them correctly currently).