slub / mets-mods2tei

Convert bibliographic meta data in MODS format to TEI headers
Apache License 2.0
8 stars 7 forks source link

support mets:div/mets:fptr reference besides mets:structLink, or full ENMAP #65

Open stefanCCS opened 2 years ago

stefanCCS commented 2 years ago

It looks like, that the METS parser does not allow structures like this in METS:

                  <div ID="DIVL5" TYPE="TITLE_OF_WORK">
                     <fptr>
                        <area BETYPE="IDREF" FILEID="ALTO00001" BEGIN="P1_TB00002"/>
                     </fptr>

If I call mm2tei with this kind of METS I get an exception:

Traceback (most recent call last):
  File "/home/calamariadmin/tei_venv_3.7/bin/mm2tei", line 8, in <module>
    sys.exit(cli())
  File "/home/calamariadmin/tei_venv_3.7/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/calamariadmin/tei_venv_3.7/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/calamariadmin/tei_venv_3.7/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/calamariadmin/tei_venv_3.7/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/calamariadmin/tei_venv_3.7/lib/python3.7/site-packages/mets_mods2tei/scripts/mets_mods2tei.py", line 56, in cli
    tei.fill_from_mets(mets, ocr)
  File "/home/calamariadmin/tei_venv_3.7/lib/python3.7/site-packages/mets_mods2tei/api/tei.py", line 175, in fill_from_mets
    self.add_div_structure(div)
  File "/home/calamariadmin/tei_venv_3.7/lib/python3.7/site-packages/mets_mods2tei/api/tei.py", line 831, in add_div_structure
    div = div.get_div()[0]
IndexError: list index out of range

As a starting point an "ignore" of <fptr><area> in <div> area would be good. In general it would be even better, if the OCR text from ALTO is taken from the link referenced there.

bertsky commented 2 years ago

div = div.get_div()[0] IndexError: list index out of range

That's actually #64 (which would entail the ignore strategy), but the additional issue here is indeed that mets:fptr/mets:area references instead of mere mets:structLink matches for the mets:div/@ID should be supported.

Maximally, support for the ENMAP profile is desired.

bertsky commented 2 years ago

Further reference: ENMAP examples