Open sebastian-meyer opened 4 years ago
I believe https://github.com/kitodo/kitodo-presentation/pull/892 addressed point 2 Update Solr configuration and point 6 Newest version of ocr-ocrhighlighting
You are right about point 6! I forgot to check the mark above.
But point 2 is still valid, I believe. solrconfig.xml
states version 7.4.0
while the most current version of Solr is 9.1.1
. A diff between the default solrconfig.xml
from Solr 9.1.1 and our solrconfig.xml
shows some differences which should be examined in order to maintain compatibility with the newest version.
Even point 6 gets valid again, when we actually upgrade to Solr 9.x. Currently we are delivering the OCR highlighting plugin for Solr 7/8, so we also need to change that...
Do you mean by "Support for bulk importing, soft commits, etc. (for generally better performance)" something like the atomic update mechanism? (https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html#atomic-updates)
If not, it would be something worth considering for your list. Updating only those (for the most part dynamic) metadata fields that actually have been changed in the backend, instead of reindexing the whole documents including the fulltexts, would give an huge performance boost.
The change happening in the fieldtype would not interfere, as its already covered in the schema (usi, uui, tsi, etc...) and it would basically result in a delete of one field in exchange for a new (already applicable) dynamic field. I have experimented in that direction with good results. Updating one or more metadata-fields for ~1.000.000 documents took only about 25-30 seconds.
Do you mean by "Support for bulk importing, soft commits, etc. (for generally better performance)" something like the atomic update mechanism?
This article describes the difference between hard and soft commits. Currently, Kitodo.Presentation does a hard commit (autoCommit
is true
) after every indexing run which is a very expensive operation and not really necessary. The proposal is to switch to soft commits as the new standard procedure to speed up indexing and to introduce a "optimize" scheduler task to make sure that segmentation doesn't become an issue.
Do we have to keep Solr8 compatibilty (point 2/6)?
Three additional questions:
In addition to my following answers please have a second look at the "Leistungsbeschreibung" for the Development Fund 2023, especially regarding the bulk imports.
Solr Improvements
solrconfig.xml
to be compatible with newest version (#1122)solrconfig.xml
for better performance and less deviation from default configuration (#1122)select
request handler for autocompletion (#1289)