qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
61 stars 13 forks source link

Display document page metadata #16

Open mikegerber opened 4 years ago

mikegerber commented 4 years ago

ALTO files contains meta information like this:

<OCRProcessing ID="IdOcr">
  <ocrProcessingStep>
    <processingDateTime>2014-05-21</processingDateTime>
    <processingSoftware>
       <softwareCreator>ABBYY</softwareCreator>
       <softwareName>ABBYY FineReader Engine</softwareName>
      <softwareVersion>11</softwareVersion>
    </processingSoftware>
  </ocrProcessingStep>
</OCRProcessing>

The report should display it.

cneud commented 3 years ago

This would be very useful!

Unfortunately it will only work for ALTO though, since for PAGE-XML there is no such provenance but one rather has to fallback on the METS container instead.

Also note that the <OCRProcessing> structure has been changed to <Processing> and heavily modified as of ALTO version 4.0.

mikegerber commented 3 years ago

For PAGE files:

    <pc:Metadata>
        <pc:Creator>OCR-D/core 2.17.0</pc:Creator>
        <pc:Created>2020-10-02T09:13:28</pc:Created>
        <pc:LastChange>2020-10-02T09:13:28</pc:LastChange>
        <pc:MetadataItem type="processingStep" name="preprocessing/optimization/binarization" value="ocrd-olena-binarize">
            <pc:Labels>
                <pc:Label value="sauvola-ms-split" type="impl"/>
                <pc:Label value="0.34" type="k"/>
                <pc:Label value="0" type="win-size"/>
                <pc:Label value="0" type="dpi"/>
            </pc:Labels>
        </pc:MetadataItem>
        <pc:MetadataItem type="processingStep" name="layout/segmentation/region" value="ocrd-sbb-textline-detector">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="/var/lib/textline_detection" type="model"/>
            </pc:Labels>
        </pc:MetadataItem>
        <pc:MetadataItem type="processingStep" name="recognition/text-recognition" value="ocrd-calamari-recognize">
            <pc:Labels externalModel="ocrd-tool" externalId="parameters">
                <pc:Label value="/var/lib/calamari-models/GT4HistOCR/2019-07-22T15_49+0200/*.ckpt.json" type="checkpoint"/>
                <pc:Label value="glyph" type="textequiv_level"/>
                <pc:Label value="confidence_voter_default_ctc" type="voter"/>
                <pc:Label value="0.001" type="glyph_conf_cutoff"/>
            </pc:Labels>
        </pc:MetadataItem>
    </pc:Metadata>
cneud commented 3 years ago

But note that only PAGE files produced by OCR-D include this information - I am not aware of any other tool producing PAGE output currently populating this section in this way.

mikegerber commented 3 years ago

Yeah, if it's not there it will not be displayed.