kitodo / kitodo-presentation

Kitodo.Presentation is a feature-rich framework for building a METS- or IIIF-based digital library. It is part of the Kitodo Digital Library Suite.
https://kitodo.org
GNU General Public License v3.0
36 stars 45 forks source link

Support for multiple namespaces #488

Open sebastian-meyer opened 4 years ago

sebastian-meyer commented 4 years ago

Many formats used in the context of Kitodo reflect their versioning in the namespace (i. e. MODS, ALTO, IIIF, TEI). We therefore need to add default support for more than one version (although multiple versions could use the same parser).

This issue was raised by https://github.com/tesseract-ocr/tesseract/pull/2815

bertsky commented 3 years ago

although multiple versions could use the same parser

To that: I completely agree, and it seems at least the ALTO parser currently does already tolerate multiple namespace versions. It would be great if the planned changes would hold that up as a principle.

(And strategically, I think it is also the best choice for data providers and workflows. We wouldn't want to discourage anyone from ingesting the best possible version of their fulltext – even if the current Presentation cannot make full use of it yet. Features like polygonal regions, angle/orientation, layout tags to differentiate different text region types and image region types, baseline curves, glyphs could all prove very valuable to future versions of Presentation. Moreover, they can be immediately useful for downstream applications of researches that simply download the fulltext.)

bertsky commented 1 year ago

@sebastian-meyer notified me (via other communication) that one necessary ingredient is v4 support in the Solr indexer.

So as long as this is not integrated/configured/tested, we still have to ingest v2, otherwise there will be full text highlighting, but no search.

sebastian-meyer commented 1 year ago

To be precise: the problem here is not the indexer itself, because that just takes whatever text it gets. But we have parsers for every supported fulltext format (for ALTO it's https://github.com/kitodo/kitodo-presentation/blob/master/Classes/Format/Alto.php) and those are currently using hard-coded namespace URIs instead of the actual URIs configured in the format table.

Also, we have to take into account the solr_ocrhighlighting plugin we are using, because that interprets ALTO as well in order to get the word coordinates. I am not sure which ALTO versions are supported by this plugin.

bertsky commented 1 year ago

Also, we have to take into account the solr_ocrhighlighting plugin we are using, because that interprets ALTO as well in order to get the word coordinates. I am not sure which ALTO versions are supported by this plugin.

looks like it is version-agnostic as well.

stweil commented 10 months ago

UB Braunschweig creates ALTO 4.2 in their Kitodo-OCR-D workflow, so that seems to work fine.

michaelkubina commented 7 months ago

A quick solution in the ALTO case would be to change the way the namespace is registered in getRawText() or getTextAsMiniOcr() in the Classes/Format/Alto.php. We could simply check for the used Namespace of the ALTO file and register it correspondingly - like:

        // instead of this...
        //$xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v2#');

        //...we could use this
        $namespace = $xml->getDocNamespaces();

        if (in_array('http://www.loc.gov/standards/alto/ns-v2#', $namespace, true)) {
            $xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v2#');
        }

        if (in_array('http://www.loc.gov/standards/alto/ns-v3#', $namespace, true)) {
            $xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v3#');
        }

        if (in_array('http://www.loc.gov/standards/alto/ns-v4#', $namespace, true)) {
            $xml->registerXPathNamespace('alto', 'http://www.loc.gov/standards/alto/ns-v4#');
        }

This way, we could at least quickly allow for the use of all ALTO file versions, instead of waiting until the multiple namespace issue is solved in general...which from what i see, is a way more complicated task.

After studying the changes in the ALTO schmema versions, the new schema versions after v2.1 do not change anything for getRawText(), since we only extract the text from the @CONTENT attribute.

For getTextAsMiniOcr() a possible issue could arise from the change of the type of HPOS, VPOS, WIDTH and HEIGHT after schema version 3.0, which is then a xsd:float instead of an xsd:int. Other then that we are still just interested in the Textline and String and simply do not use other features for now. Currently i have not been able to find ALTO files, where actual fractions are used for position attributes. Tesseract exports as ALTO v3 with whole integers and Kraken exports ALTO v4 with whole integers as well (https://kraken.re/4.0/ketos.html).

Testing

We are currently using SOLR 9.3.0 with the 0.8.3 version of the dbmdz.solr-ocrhighlighting module. Indexing the OCR from both ALTO v2 and v3 works fine with the proposed solution. I was not able to test it for ALTO v4, since we dont produce any such files at the moment. Highlighting (snippets) in the Listview does work as well - no issues here.

Maybe others would like to test as well?

stweil commented 1 month ago

it seems at least the ALTO parser currently https://github.com/OCR-D/core/issues/544#issuecomment-868233760 multiple namespace versions

UB Braunschweig creates ALTO 4.2 in their Kitodo-OCR-D workflow, so that seems to work fine.

We just noticed that ony ALTO v2 is fully supported!

While Kitodo.Presentation is able to show the fulltext for v2 and newer ALTO versions (JavaScript code), it fails to add that fulltext to its search index (PHP code) as we noticed this week.

michaelkubina commented 1 month ago

I have made a PR a while ago, that allows for indexing of alto 2, 3 and 4 - see #1117 for it. It was tested with solr 9.3.0 and the ocr-module in version 0.8.3

We use it this way in our current production environment. Its just a small change to the alto-parser and mini-ocr and registers the individual namespace of the alto file.