Open mikegerber opened 4 years ago
This would be very useful!
Unfortunately it will only work for ALTO though, since for PAGE-XML there is no such provenance but one rather has to fallback on the METS container instead.
Also note that the <OCRProcessing>
structure has been changed to <Processing>
and heavily modified as of ALTO version 4.0.
For PAGE files:
<pc:Metadata>
<pc:Creator>OCR-D/core 2.17.0</pc:Creator>
<pc:Created>2020-10-02T09:13:28</pc:Created>
<pc:LastChange>2020-10-02T09:13:28</pc:LastChange>
<pc:MetadataItem type="processingStep" name="preprocessing/optimization/binarization" value="ocrd-olena-binarize">
<pc:Labels>
<pc:Label value="sauvola-ms-split" type="impl"/>
<pc:Label value="0.34" type="k"/>
<pc:Label value="0" type="win-size"/>
<pc:Label value="0" type="dpi"/>
</pc:Labels>
</pc:MetadataItem>
<pc:MetadataItem type="processingStep" name="layout/segmentation/region" value="ocrd-sbb-textline-detector">
<pc:Labels externalModel="ocrd-tool" externalId="parameters">
<pc:Label value="/var/lib/textline_detection" type="model"/>
</pc:Labels>
</pc:MetadataItem>
<pc:MetadataItem type="processingStep" name="recognition/text-recognition" value="ocrd-calamari-recognize">
<pc:Labels externalModel="ocrd-tool" externalId="parameters">
<pc:Label value="/var/lib/calamari-models/GT4HistOCR/2019-07-22T15_49+0200/*.ckpt.json" type="checkpoint"/>
<pc:Label value="glyph" type="textequiv_level"/>
<pc:Label value="confidence_voter_default_ctc" type="voter"/>
<pc:Label value="0.001" type="glyph_conf_cutoff"/>
</pc:Labels>
</pc:MetadataItem>
</pc:Metadata>
But note that only PAGE files produced by OCR-D include this information - I am not aware of any other tool producing PAGE output currently populating this section in this way.
Yeah, if it's not there it will not be displayed.
ALTO files contains meta information like this:
The report should display it.