Open henning-gerhardt opened 3 years ago
@henning-gerhardt @Kathrin-Huber
This parameter "index.mapping.nested_objects.limit" seems to have been added as of version ElasticSearch 7.0. https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html#limit-number-nested-json-objects
The question is whether the default is too low for our purposes? Depending on the server resources available, this could also be increased. Even with a value of for example 30000, this protects against memory errors in a powerful environment. It is not without reason that this is a parameter! ;)
Nevertheless, we have to look at which ends we can optimize the source code here in order to avoid memory errors.
As a quick solution, I would recommend adjusting the parameter if there are enough resources. If there are known optimizations, we should create an issue to improve indexing.
From the ElasticSearch documentation on the nested field type:
The
nested
type is a specialised version of theobject
data type that allows arrays of objects to be indexed in a way that they can be queried independently of each other.
So, if we don’t use querying the objects independently of each other, we don’t need a nested
type here. If I understand the manual correctly, this type internally creates one index object per entry of the nested structure. This makes sense in cases, where you are not only interested in whether a token is found within a record, but also where. For example: When indexing the full OCR text of a book, you want to know from the search result on which page of a book a token was found. In this case, the nested type must be formed in a way that you have one nested object per page.
As—to my knowledge—we do not use such information in our context, this isn’t necessary at the moment. However, changing the nested
field from the index will remove this possibility for us in the future, which might have been intended. I cannot say anything to that, since I don’t know a documentation of the index profile.
@markusweigelt So far as I understand this parameter, this parameter influence the behavior on the server side and not on the client side. I would suggest to make this parameter configurable through the kitodo_config.properties
file including a good explanation of this parameter and when this parameter should be changed and when not. The remaining questions for me is: how big is the influence of this parameter if we must change it from 10.000 to 30.000 or more? How many more resources (RAM, disk space, ...) are needed?
As I understand the parameter is in the ElasticSearch configuration file. If so, Production could just call a sudo
script that edits the configuration file and restarts ElasticSearch. But do we need that?
How many more resources (RAM, disk space, ...) are needed?
It's hard to say in general, but as you can see, a separate index entry is created for each structure element × each metadata entry, for the example document over 10,000 index records are created, which is why the error occurs. I think it is possible to increase the parameter a bit now, but it indicates an improper implementation of the search engine usage.
[...] The remaining questions for me is: how big is the influence of this parameter if we must change it from 10.000 to 30.000 or more? How many more resources (RAM, disk space, ...) are needed?
I think there are many adjusting screws (RAM, disk space, entering data volume) here that have an influence on behavior. If we want to know exactly, we would have to use ElasticSearch in conjunction with Kibana or Grafana etc. If that is possible in the free version of ElasticSearch. Then we can change the parameter and monitor the influence.
I think that the parameter is based on the minimum requirement of ElasticSearch. If we theoretically put these in relation to our available resources, we could change them up to this maximum. I cannot currently find out why this parameter has a default value of 10000 and how this value was determined. It may also be too low in general.
Setting the parameter "index.mapping.nested_objects.limit" to 30000 through like
curl -XPUT "<es-host>:9200/kitodo_process/_settings" -H 'Content-Type: application/json' -d' { "index.mapping.nested_objects.limit" : 30000 }'
solved temporarly the issue until the ElasticSearch index get destroyed.
Setting this value must be done after creating the mapping inside ElasticSearch but before you start the indexing of processes or you must redone everything again. Setting this parameter should be done inside the application instead of running a curl command on the ElasticSearch server.
Indexing a process with a lot of structure (> 450) and meta data elements (> 2960) fails with
The mentioned process is already available under https://digital.slub-dresden.de/id1685679609 as this issue is happened on re-indexing the process data.