kitodo / kitodo-presentation

Kitodo.Presentation is a feature-rich framework for building a METS- or IIIF-based digital library. It is part of the Kitodo Digital Library Suite.
https://kitodo.github.io/kitodo-presentation/
GNU General Public License v3.0
39 stars 45 forks source link

[FUND] Update and improve Solr compatibility #594

Closed albig closed 3 weeks ago

albig commented 3 years ago

Description

On high traffic installations with lots of fulltext documents (>200.000) the performance of the Solr-index is getting poor. This is caused because of permanent indexing of new documents with high usage of searches in parallel. This applies not only the search plugin but also the collection and OAI plugin.

Some research has already been done and tasks are identified in #454.

The goal of this proposal is to update all Solr-related code and configuration in order to use the newest version of Apache Solr and make installation and configuration as easy and well-documented as possible.

Expected benefits of this development

Estimated Costs and Complexity

This issue has high complexity and medium cost.

Related Issues

sebastian-meyer commented 1 year ago

Reintroducing this for the development fund 2023. The issue became more urgent recently not only because new features regularly require reindexing of all documents, but also because current versions of Solr not only deprecate using index-time boosting, but don't support it anymore. So this effectively prevents us from using an up-to-date Solr version.

sebastian-meyer commented 1 year ago

Votes: 12

michaelkubina commented 1 year ago

We at the SUB Hamburg already made several adjustments on our SOLR instances (configurations, schemata and some tweaks in kitodo.presentation sources) and are currently using SOLR 8.11.1 in our livesystems and are experimenting with SOLR 9.1.1 in our dev-systems (which works fine btw, after some smaller adjustments). Some of those changes and insights resulted in PR's improving search...others are still in the working.

If this topic gets further traction, i would happily offer my help and would like to join the discussion. For me its an important topic to improve overall performance (indexing & retrieval & maintanence). And of course some things we are working at could be impacted in a negative way, if development on the SOLR would make unexpected shifts.

michaelkubina commented 1 year ago

Hello Uli, as promised i share with you the analyzer-chain, that we applied to the standard field and to the text_ocr field. All Filter (except for the ocr-highlighting filter) are part of solr itself (as you already know its documented here: https://solr.apache.org/guide/solr/latest/indexing-guide/filters.html ). In the solr-admin-ui you can inspect, how the filters get applied, when using the "Analysis" tab, that you find within your core.

Foremost: we have added a _version_ field to the index. This allows for keeping track of document versioning, which does no harm. But this allows for partial document updates, like atomic updates (https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html#atomic-updates). This is useful, when only small corrections at the indexed documents should be applied (like changing a url-prefix in a specific field and such), instead of re-indexing a whole document with all its fulltexts and logical structure. I believe its useful to have this option, even if one does not make use of it. Without the _version_ field partial document updates wont be possible.

The sorting related change of the fieldtype for *_sorting is already in place due to a past commit to this branch. no need for changes here...just remember, that a new solr-core must be index for it to work. Otherwise one will run into exceptions...

The standard fieldtype has seen several changes, most notably:

<fieldType name="standard" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <!-- michaelkubina: tokenize at whitespace -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- michaelkubina: translitarate according to ICU Transformation to latin script -->
        <filter class="solr.ICUTransformFilterFactory" id="Any-Latin"/>
        <!-- michaelkubina: apply ICU Folding on latin script (basically like ascii folding) -->
        <filter name="icuFolding"/>
        <!-- michaelkubina: lowercase tokens as soon as possible -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <!-- michaelkubina: catenate hyphenated words or combinations of alphanumericals ; camelcase wont happen due to lowercasefilter at the beginning; removes all non-alphanumericals as well -->
        <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" preserveOriginal="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
        <!-- michaelkubina: flatten word graph -->
        <filter class="solr.FlattenGraphFilterFactory"/>
        <!-- only needed in the index-analyzer -->
        <!-- michaelkubina: keep keywords as duplicate tokens and prevent them from getting stemmed -->
        <filter class="solr.KeywordRepeatFilterFactory"/>
        <!-- michaelkubina: do the stemming -->
        <filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords.txt"/>
        <!-- michaelkubina: remove duplicate tokens for the same position increment -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <!-- michaelkubina: allow synonym-aggregation at query-time -->
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <!-- michaelkubina: tokenize at whitespace -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- michaelkubina: translitarate according to ICU Transformation to latin script -->
        <filter class="solr.ICUTransformFilterFactory" id="Any-Latin"/>
        <!-- michaelkubina: apply ICU Folding on latin script (basically like ascii folding) -->
        <filter name="icuFolding"/>
        <!-- michaelkubina: lowercase tokens as soon as possible -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <!-- michaelkubina: catenate hyphenated words or combinations of alphanumericals ; camelcase wont happen due to lowercasefilter at the beginning; removes all non-alphanumericals as well -->
        <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" preserveOriginal="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
        <!-- michaelkubina: keep keywords as duplicate tokens and prevent them from getting stemmed -->
        <filter class="solr.KeywordRepeatFilterFactory"/>
        <!-- michaelkubina: do the stemming -->
        <filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords.txt"/>
        <!-- michaelkubina: remove duplicate tokens for the same position increment -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>

The text_ocr fieldtype is very similiar, but has some additional filters:

<fieldType name="text_ocr" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
    <analyzer type="index">
        <!-- michaelkubina: account for some ocr-engines escaping html characters -->
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <!-- michaelkubina: tokenize at whitespace -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- michaelkubina: translitarate according to ICU Transformation to latin script -->
        <filter class="solr.ICUTransformFilterFactory" id="Any-Latin"/>
        <!-- michaelkubina: apply ICU Folding on latin script (basically like ascii folding) -->
        <filter name="icuFolding"/>
        <!-- michaelkubina: lowercase tokens as soon as possible -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <!-- michaelkubina: compound tokens if hyphen at the end of one token suggests it being part of a compound word with the then following token -->
        <filter class="solr.HyphenatedWordsFilterFactory"/>
        <!-- michaelkubina: catenate hyphenated words or combinations of alphanumericals ; camelcase wont happen due to lowercasefilter at the beginning; removes all non-alphanumericals as well -->
        <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" preserveOriginal="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
        <!-- michaelkubina: flatten word graph -->
        <filter class="solr.FlattenGraphFilterFactory"/>
        <!-- michaelkubina: keep keywords as duplicate tokens and prevent them from getting stemmed -->
        <filter class="solr.KeywordRepeatFilterFactory"/>
        <!-- michaelkubina: remove any trailing or leading whitespaces from tokens, if it happened for any reason -->
        <filter class="solr.TrimFilterFactory"/>
        <!-- michaelkubina: do the stemming -->
        <filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords.txt"/>
        <!-- michaelkubina: reverse all tokens, so that they can be found faster in a reverse wildcard search (only needed at index-time) -->
        <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
        <!-- michaelkubina: remove duplicate tokens for the same position increment -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <!-- michaelkubina: tokenize at whitespace -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- michaelkubina: translitarate according to ICU Transformation to latin script -->
        <filter class="solr.ICUTransformFilterFactory" id="Any-Latin"/>
        <!-- michaelkubina: apply ICU Folding on latin script (basically like ascii folding) -->
        <filter name="icuFolding"/>
        <!-- michaelkubina: lowercase tokens as soon as possible -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <!-- michaelkubina: allow synonym-aggregation at query-time -->
        <filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <!-- michaelkubina: catenate hyphenated words or combinations of alphanumericals ; camelcase wont happen due to lowercasefilter at the beginning; removes all non-alphanumericals as well -->
        <filter class="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" preserveOriginal="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"/>
        <!-- michaelkubina: keep keywords as duplicate tokens and prevent them from getting stemmed -->
        <filter class="solr.KeywordRepeatFilterFactory"/>
        <!-- michaelkubina: remove any trailing or leading whitespaces from tokens, if it happened for any reason -->
        <filter class="solr.TrimFilterFactory"/>
        <!-- michaelkubina: do the stemming -->
        <filter class="solr.SnowballPorterFilterFactory" language="German" protected="protwords.txt"/>
        <!-- michaelkubina: remove duplicate tokens for the same position increment -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
</fieldType>
michaelkubina commented 1 year ago

Part 2:

The solrconfig.xml has some changes as well...not as complex, as those in the schema. So here is likely more room for optimizations. As you have already realized, the plugins are now in the modules folder...not contrib. And the velocity-browser has been removed, so no need to keep it in place.

        <updateLog>
            <str name="dir">${solr.ulog.dir:}</str>
            <int name="numVersionBuckets">${solr.ulog.numVersionBuckets:65536}</int>
        </updateLog>

If you have furhter questions, then dont hesitate to ask.