This property was introduced as an outcome of #991 and was considered as the best way to slice text data into more chunks before sending it to the text mining in order to quadruple the number of tasks (thuse reduce the single task execution time by 4).
This worked pretty well until the volume of the text data grew to an extent where the number of files with publication texts in a DocumentText avro datastore exceeded 32k threshold which is also the maximum number of open files limit on the data nodes.
The problem was originally discovered and described in this redmine ticket note.
New parameter textUnionBlockSize value (64m) was already overidden, as a hot-fix, within the default-config.xml file for the most recent IIS deployments but we should also update it in the IIS workflow definition.
This property was introduced as an outcome of #991 and was considered as the best way to slice text data into more chunks before sending it to the text mining in order to quadruple the number of tasks (thuse reduce the single task execution time by 4).
This worked pretty well until the volume of the text data grew to an extent where the number of files with publication texts in a
DocumentText
avro datastore exceeded 32k threshold which is also the maximum number of open files limit on the data nodes.The problem was originally discovered and described in this redmine ticket note.
New parameter
textUnionBlockSize
value (64m
) was already overidden, as a hot-fix, within thedefault-config.xml
file for the most recent IIS deployments but we should also update it in the IIS workflow definition.