Open bertsky opened 2 years ago
Don't worry, this is originates from my very first and superficial interpretation of ALTO to express additional content information.
Has nothing to do with Transkribus, how drops this element anyway due it's limited Transformation capabilities.
With Version 2.1 (2014), according to ALTO Schema they introduced annotations like LayoutTag
StructureTag
RoleTag
NamedEntityTag
OtherTag
. Nowadays I guess they were intended to be able to express neat relations from even single String-Element's TAGREFS
via the NER-Tag.
If I would do it again ( ... which is not planned) I'd go for the ComposedBlockType@TYPE
attribute, which shall a string to express what sort the included sub-regions are made of: table, advertisement, ... (example values from ALTO schema definition).
The type-stuff for Blocks (and Illustrations!) seems to be part of the spec since the very beginning. It's dated in the prelude back to 2004, even before Version 1.3 of ALTO has been tagged.
https://github.com/ulb-sachsen-anhalt/digital-eval/blob/98a9c243447444c04cb7634eb588b7711bfa9b44/digital_eval/model.py#L234-L237
I wonder what ALTO version
OtherTag/@VALUE
conforms to. Is that a Transkribus or ULB extension @M3ssman ?Generally, IMO we do need to support this kind of information in the annotation files themselves (PAGE/ALTO), but should also consider the case where it enters as metadata (METS/MODS). For the latter, we have the https://github.com/ocr-d/gt-labelling schema, but that does not contain any definitions on subject/genre/content class yet. There is a classification schema for content items in ENMAP (§10 Annex 2), a set of newspaper article types in DTABf for example. Somewhat related, one could also consider relevant the non-structural (i.e. metadata) types of DFG Strukturdatenset, or the general set of text sorts in DTA and DWDS...
Anyway, back to the annotation schema in ALTO: Why
OtherTag
in the first place – shouldn't this kind of information be placed inLayoutTag
by convention? On the PAGE side, it's alwaysMetadataItem
I suppose.Here I made a proposal to mirror the gt-labelling info from MODS into the MetadataItem in PAGE BTW.
@kba RFC