kba / page-to-alto

Convert PAGE (v. 2019) to ALTO (v. 2.0 - 4.2)
Apache License 2.0
14 stars 5 forks source link

output Created/LastChange timestamp as processingDateTime, fix #36 #37

Closed kba closed 8 months ago

kba commented 8 months ago

With this PR, the alto:processingDateTime element of an alto:processingStep will be set to either the pc:Created timestamp (--timestamp-src Created), the pc:LastChange timestamp (--timestamp-src LastChange) or not at all like before (--timestamp-src none).

This is not 100% correct since Created and LastChange are document-wide and not step-specific but we have no other source for them AFAICS and it is important for our (@StaatsbibliothekBerlin) workflows to have at least an approximate date for versioning purposes in the alto:processingSteps.

bertsky commented 8 months ago

IMHO the correct representation would have been:

For ALTO v2 with its preProcessingStep|ocrProcessingStep|postProcessingStep distinction, one would probably have to map to:

But obviously, this is not ideal. However, since PAGE's Created/LastChange does not have a clear semantics, I would argue this is the best pragmatic fit.

BTW, we are also still missing Metadata/Creator! IMO this should go into the contentGeneration (or preProcessingStep) entry.