kitodo / kitodo-production

Kitodo.Production is a workflow management tool for mass digitization and is part of the Kitodo Digital Library Suite.
http://www.kitodo.org/software/kitodoproduction/
GNU General Public License v3.0
62 stars 63 forks source link

Indexing is failing on many structure and meta data element #4724

Open henning-gerhardt opened 2 years ago

henning-gerhardt commented 2 years ago

Indexing a process with a lot of structure (> 450) and meta data elements (> 2960) fails with

[ERROR] 2021-10-07 11:58:09,510 [I/O dispatcher 7] org.kitodo.data.elasticsearch.index.ResponseListener - failure in bulk execution:
[1714]: index [kitodo_process], type [_doc], id [367280], message [ElasticsearchException[Elasticsearch exception [type=mapper_parsing_exception, reason=The number of nested documents has exceeded the allowed limit of [10000]. This limit can be set by changing the [index.mapping.nested_objects.limit] index level setting.]]]

The mentioned process is already available under https://digital.slub-dresden.de/id1685679609 as this issue is happened on re-indexing the process data.

markusweigelt commented 2 years ago

@henning-gerhardt @Kathrin-Huber

This parameter "index.mapping.nested_objects.limit" seems to have been added as of version ElasticSearch 7.0. https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html#limit-number-nested-json-objects

The question is whether the default is too low for our purposes? Depending on the server resources available, this could also be increased. Even with a value of for example 30000, this protects against memory errors in a powerful environment. It is not without reason that this is a parameter! ;)

Nevertheless, we have to look at which ends we can optimize the source code here in order to avoid memory errors.

As a quick solution, I would recommend adjusting the parameter if there are enough resources. If there are known optimizations, we should create an issue to improve indexing.

matthias-ronge commented 2 years ago

From the ElasticSearch documentation on the nested field type:

The nested type is a specialised version of the object data type that allows arrays of objects to be indexed in a way that they can be queried independently of each other.

So, if we don’t use querying the objects independently of each other, we don’t need a nested type here. If I understand the manual correctly, this type internally creates one index object per entry of the nested structure. This makes sense in cases, where you are not only interested in whether a token is found within a record, but also where. For example: When indexing the full OCR text of a book, you want to know from the search result on which page of a book a token was found. In this case, the nested type must be formed in a way that you have one nested object per page.

As—to my knowledge—we do not use such information in our context, this isn’t necessary at the moment. However, changing the nested field from the index will remove this possibility for us in the future, which might have been intended. I cannot say anything to that, since I don’t know a documentation of the index profile.

henning-gerhardt commented 2 years ago

@markusweigelt So far as I understand this parameter, this parameter influence the behavior on the server side and not on the client side. I would suggest to make this parameter configurable through the kitodo_config.properties file including a good explanation of this parameter and when this parameter should be changed and when not. The remaining questions for me is: how big is the influence of this parameter if we must change it from 10.000 to 30.000 or more? How many more resources (RAM, disk space, ...) are needed?

matthias-ronge commented 2 years ago

As I understand the parameter is in the ElasticSearch configuration file. If so, Production could just call a sudo script that edits the configuration file and restarts ElasticSearch. But do we need that?

matthias-ronge commented 2 years ago

How many more resources (RAM, disk space, ...) are needed?

It's hard to say in general, but as you can see, a separate index entry is created for each structure element × each metadata entry, for the example document over 10,000 index records are created, which is why the error occurs. I think it is possible to increase the parameter a bit now, but it indicates an improper implementation of the search engine usage.

markusweigelt commented 2 years ago

[...] The remaining questions for me is: how big is the influence of this parameter if we must change it from 10.000 to 30.000 or more? How many more resources (RAM, disk space, ...) are needed?

I think there are many adjusting screws (RAM, disk space, entering data volume) here that have an influence on behavior. If we want to know exactly, we would have to use ElasticSearch in conjunction with Kibana or Grafana etc. If that is possible in the free version of ElasticSearch. Then we can change the parameter and monitor the influence.

I think that the parameter is based on the minimum requirement of ElasticSearch. If we theoretically put these in relation to our available resources, we could change them up to this maximum. I cannot currently find out why this parameter has a default value of 10000 and how this value was determined. It may also be too low in general.

henning-gerhardt commented 2 years ago

Setting the parameter "index.mapping.nested_objects.limit" to 30000 through like

curl -XPUT "<es-host>:9200/kitodo_process/_settings" -H 'Content-Type: application/json' -d' { "index.mapping.nested_objects.limit" : 30000 }'

solved temporarly the issue until the ElasticSearch index get destroyed.

Setting this value must be done after creating the mapping inside ElasticSearch but before you start the indexing of processes or you must redone everything again. Setting this parameter should be done inside the application instead of running a curl command on the ElasticSearch server.