Open bertsky opened 2 years ago
Hi @bertsky, we also currently have the problem that ENMAP is not covered, so mm2tei does not work on those files. Do you know by any chance if there a workaround that can be used with mets files like these?
@t-mayer I have been digging a little deeper, and found that ENMAP support would require a lot more besides flexible mets:div
parsing (to identify the top content element encompassing all sections/paragraphs/...) in the mets:structMap
:
content_section
and (any recursive number of) content_unit
above the other types (article
, table
etc);content_unit
in use, too (e.g. caption
, advertisement
, body
, body_content
, heading
, title
...)@USE
(in a predefined set of names), while ENMAP requires recursive fileGrps named by @ID
in a certain waymets:structLink
mechanism to map (ranges of) whole pages to divs, but ENMAP uses mets:area
(with @COORDS
, which would be hard to match with fulltext files, or @BEGIN
, which references the ALTO elements directly)mets:area/@FILEID
references to the mets:file entries instead of direct mets:fptr/@FILEID
in the physical structMapThere are probably more challenges, but this is already a lot of work. So I'm afraid there is no simple workaround. Sorry, I cannot promise any progress on this matter ATM. But PRs are always welcome of course!
duplicate of #65
self-note: cf. MODS and METS parsing in ULB Halle's digital-flow
We currently rely on the assumption, that the
mets:div
content element contains an@ADMID
(which is mandatory by METS DFG application profile, but optional in the ENMAP profile):https://github.com/slub/mets-mods2tei/blob/47f5bc283628438673cff5976b5af07b46790437/mets_mods2tei/api/tei.py#L828-L837
Since this is fragile and inflexible, the parser should probably search for
@TYPE
and@ID
(perhaps cross-checking with structlink) instead.