openaire / iis

Information Inference Service of the OpenAIRE system
Apache License 2.0
20 stars 11 forks source link

Change the textUnionBlockSize default value from 32m to 64m #1441

Closed marekhorst closed 8 months ago

marekhorst commented 8 months ago

This property was introduced as an outcome of #991 and was considered as the best way to slice text data into more chunks before sending it to the text mining in order to quadruple the number of tasks (thuse reduce the single task execution time by 4).

This worked pretty well until the volume of the text data grew to an extent where the number of files with publication texts in a DocumentText avro datastore exceeded 32k threshold which is also the maximum number of open files limit on the data nodes.

The problem was originally discovered and described in this redmine ticket note.

New parameter textUnionBlockSize value (64m) was already overidden, as a hot-fix, within the default-config.xml file for the most recent IIS deployments but we should also update it in the IIS workflow definition.