tboenig / page2page

This repository save the stylesheet and workaround for transforming the properitary PAGE XML file from Transkribus (https://transkribus.eu/Transkribus) into a PAGE XML valid format (https://www.primaresearch.org/schema/PAGE/gts/pagecontent/ newest version from 2019-07-16
3 stars 2 forks source link

translate TranskribusMetadata into v>=2018 MetadataItem #1

Open bertsky opened 2 years ago

bertsky commented 2 years ago

The valuable information does not have to be removed. Transforming not just the attributes, but also its recursive Property elements into MetadataItem Labels is worthwhile IMO.

I believe the schema under https://github.com/Transkribus/TranskribusPageformat/blob/master/pagecontent_extension.xsd is not up to date anymore. The current version seems to be under https://gitlab.com/readcoop/transkribus/TranskribusCore/-/blob/master/src/main/resources/xsd/pagecontent_extension.xsd

tboenig commented 2 years ago

Hi, @bertsky;

It may be true that there is valuable information in the attribute, but because the allocation does not work with controlled values, it is difficult to convert them into real page values. For this reason, I would ignore them for now.

key value documentation: key of the property - this could pre specified keys like "lang", "layout", "year_from", "year_to", "style", "weight" or meta data from the automatic process "editor", "editordate" or user defined properties "numbering"

comment: There are suggestions for keys in the documentation. The author or the system does not have to follow these.

bertsky commented 2 years ago

Sure, without an actual schema it's risky to rely on these values. But at least for humans this is always useful. And there are no restrictions for names or text under PAGE's MetadataItem/Labels/Label element.

And once you accept the risk of being incomplete or becoming outdated when parsing them into standard metadata formats, it might still be worth trying, if ambitious. (Merely converting them to syntactically allowed representation under MetadataItem would only be a prerequisite to that additional goal.)