ulb-sachsen-anhalt / digital-eval

Evaluate data from mass digitalization workflows
MIT License
5 stars 1 forks source link

ALTO reader: GT type (subject) from @VALUE ? #3

Open bertsky opened 2 years ago

bertsky commented 2 years ago

https://github.com/ulb-sachsen-anhalt/digital-eval/blob/98a9c243447444c04cb7634eb588b7711bfa9b44/digital_eval/model.py#L234-L237

I wonder what ALTO version OtherTag/@VALUE conforms to. Is that a Transkribus or ULB extension @M3ssman ?

Generally, IMO we do need to support this kind of information in the annotation files themselves (PAGE/ALTO), but should also consider the case where it enters as metadata (METS/MODS). For the latter, we have the https://github.com/ocr-d/gt-labelling schema, but that does not contain any definitions on subject/genre/content class yet. There is a classification schema for content items in ENMAP (§10 Annex 2), a set of newspaper article types in DTABf for example. Somewhat related, one could also consider relevant the non-structural (i.e. metadata) types of DFG Strukturdatenset, or the general set of text sorts in DTA and DWDS...

Anyway, back to the annotation schema in ALTO: Why OtherTag in the first place – shouldn't this kind of information be placed in LayoutTag by convention? On the PAGE side, it's always MetadataItem I suppose.

Here I made a proposal to mirror the gt-labelling info from MODS into the MetadataItem in PAGE BTW.

@kba RFC

M3ssman commented 2 years ago

Don't worry, this is originates from my very first and superficial interpretation of ALTO to express additional content information.

Has nothing to do with Transkribus, how drops this element anyway due it's limited Transformation capabilities.

With Version 2.1 (2014), according to ALTO Schema they introduced annotations like LayoutTag StructureTag RoleTag NamedEntityTag OtherTag . Nowadays I guess they were intended to be able to express neat relations from even single String-Element's TAGREFS via the NER-Tag.

If I would do it again ( ... which is not planned) I'd go for the ComposedBlockType@TYPE attribute, which shall a string to express what sort the included sub-regions are made of: table, advertisement, ... (example values from ALTO schema definition).

The type-stuff for Blocks (and Illustrations!) seems to be part of the spec since the very beginning. It's dated in the prelude back to 2004, even before Version 1.3 of ALTO has been tagged.