kitodo / kitodo-presentation

Kitodo.Presentation is a feature-rich framework for building a METS- or IIIF-based digital library. It is part of the Kitodo Digital Library Suite.
https://kitodo.org
GNU General Public License v3.0
38 stars 45 forks source link

[FUND] Solr Improvements #454

Open sebastian-meyer opened 4 years ago

sebastian-meyer commented 4 years ago

Solr Improvements

csidirop commented 1 year ago

I believe https://github.com/kitodo/kitodo-presentation/pull/892 addressed point 2 Update Solr configuration and point 6 Newest version of ocr-ocrhighlighting

sebastian-meyer commented 1 year ago

You are right about point 6! I forgot to check the mark above.

But point 2 is still valid, I believe. solrconfig.xml states version 7.4.0 while the most current version of Solr is 9.1.1. A diff between the default solrconfig.xml from Solr 9.1.1 and our solrconfig.xml shows some differences which should be examined in order to maintain compatibility with the newest version.

sebastian-meyer commented 1 year ago

Even point 6 gets valid again, when we actually upgrade to Solr 9.x. Currently we are delivering the OCR highlighting plugin for Solr 7/8, so we also need to change that...

michaelkubina commented 1 year ago

Do you mean by "Support for bulk importing, soft commits, etc. (for generally better performance)" something like the atomic update mechanism? (https://solr.apache.org/guide/solr/latest/indexing-guide/partial-document-updates.html#atomic-updates)

If not, it would be something worth considering for your list. Updating only those (for the most part dynamic) metadata fields that actually have been changed in the backend, instead of reindexing the whole documents including the fulltexts, would give an huge performance boost.

The change happening in the fieldtype would not interfere, as its already covered in the schema (usi, uui, tsi, etc...) and it would basically result in a delete of one field in exchange for a new (already applicable) dynamic field. I have experimented in that direction with good results. Updating one or more metadata-fields for ~1.000.000 documents took only about 25-30 seconds.

sebastian-meyer commented 1 year ago

Do you mean by "Support for bulk importing, soft commits, etc. (for generally better performance)" something like the atomic update mechanism?

This article describes the difference between hard and soft commits. Currently, Kitodo.Presentation does a hard commit (autoCommit is true) after every indexing run which is a very expensive operation and not really necessary. The proposal is to switch to soft commits as the new standard procedure to speed up indexing and to introduce a "optimize" scheduler task to make sure that segmentation doesn't become an issue.

frank-ulrich-weber commented 10 months ago

Do we have to keep Solr8 compatibilty (point 2/6)?

frank-ulrich-weber commented 10 months ago

Three additional questions:

  1. The CLI commands can be directly configured and executed using the scheduler. Is it still required to create seperat scheduler tasks for that?
  2. If we use the REST-API to update the schema we still should deploy a managed-schema.xml, otherwise we have the changes/schema only within the code and not within a readable schema.xml file. All the other files around this like stopwords.txt, solrconfig.xml, ... we still have to deploy manually. It is not possible to use the REST-API to deploy complete files. Maybe individual changes are overwritten by an update... I think as long as there is no need to switch to a managed-schema we should stay with that what we have now?
  3. What exactly is meant by bulk import?
sebastian-meyer commented 10 months ago

In addition to my following answers please have a second look at the "Leistungsbeschreibung" for the Development Fund 2023, especially regarding the bulk imports.

  1. The idea was to not "optimize" the index after each index-run (because that is quite time consuming), but instead have a scheduler task do it regularly. But currently there is no CLI command to optimize the index, so that needs to be implemented. Also, while creating your own scheduler tasks using the existing CLI commands is possible, it's much more convenient to have them prepared and provided by the extension. Have a look at https://docs.typo3.org/c/typo3/cms-scheduler/main/en-us/DevelopersGuide/CreatingTasks/Index.html on how to create tasks in an extension. (But of course the easiest way to achieve this is to just create tasks as wrappers for the CLI commands!)
  2. Sure, there are some pros and cons of having a managed schema vs. using a static one. Also, I don't think switching to a managed schema is strictly necessary, but it would allow us to provide automatic update scripts for Solr if a new version of Kitodo.Presentation comes with schema changes. Since this feature is part of the development fund, I can't just drop it as a release manager. This needs to be discussed with and decided by the board members.
  3. The main issue with indexing large amounts of documents is that we are currently exclusively using hard commits. This is a safe way of indexing but also much slower than soft commits. Going forward, while hard commits should remain the standard for indexing single documents, we should switch to soft commits when re-indexing collections. Also, there should be an CLI option to use soft commits for single documents as well, in case somebody wants to import bulks of documents by script.