Baseform: memory optimization

ThaDafinser commented 7 years ago

When i use the baseform plugin for some (> 1.000.000) documents, i'm getting this error

[2017-04-06T07:28:07,712][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ultimate-1] fatal error in thread [elasticsearch[ultimate-1][clusterService#updateTask][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3236) ~[?:1.8.0_121]
    at org.xbib.elasticsearch.common.fsa.FSABuilder.expandBuffers(FSABuilder.java:468) ~[?:?]
    at org.xbib.elasticsearch.common.fsa.FSABuilder.serialize(FSABuilder.java:418) ~[?:?]
    at org.xbib.elasticsearch.common.fsa.FSABuilder.freezeState(FSABuilder.java:352) ~[?:?]
    at org.xbib.elasticsearch.common.fsa.FSABuilder.add(FSABuilder.java:204) ~[?:?]
    at org.xbib.elasticsearch.common.fsa.Dictionary.loadLines(Dictionary.java:43) ~[?:?]
    at org.xbib.elasticsearch.index.analysis.baseform.BaseformTokenFilterFactory.createDictionary(BaseformTokenFilterFactory.java:39) ~[?:?]
    at org.xbib.elasticsearch.index.analysis.baseform.BaseformTokenFilterFactory.<init>(BaseformTokenFilterFactory.java:27) ~[?:?]
    at org.xbib.elasticsearch.plugin.bundle.BundlePlugin$$Lambda$379/386311625.get(Unknown Source) ~[?:?]
    at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:361) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenFilterFactories(AnalysisRegistry.java:171) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:155) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.index.IndexService.<init>(IndexService.java:145) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:363) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:427) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:392) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.cluster.metadata.MetaDataCreateIndexService$1.execute(MetaDataCreateIndexService.java:364) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:45) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.cluster.service.ClusterService.executeTasks(ClusterService.java:679) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.cluster.service.ClusterService.calculateTaskOutputs(ClusterService.java:658) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.cluster.service.ClusterService.runTasks(ClusterService.java:617) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:1117) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:544) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) ~[elasticsearch-5.3.0.jar:5.3.0]
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) ~[elasticsearch-5.3.0.jar:5.3.0]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_121]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_121]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]

jprante commented 7 years ago

Thanks for the report.

It's not a leak but the FSA is quite memory hungry when an index is created, the way it is implemented. Will be investigated to reduce memory size.

ThaDafinser commented 7 years ago

I just tried now building a lot of indices with the settings below.

Even without using the filter explicit, the memory seems to be required. (got the same exception)

//update i'm going to recreate now all indices without the baseform filter defined and watch if then the plugin doesn't crash ES.

ThaDafinser commented 7 years ago

Without using the baseform (removed also the defined filter), it still seems to be a memory problem.

@jprante i only created in toal 24MB of indices/documents, but JVM is full and Kibana goes again to timeouts.

ThaDafinser commented 7 years ago

After disabling the whole plugin, the JVM memory usage is stable.

Quick idea: load FSA with ES startup? (only once)

ThaDafinser commented 7 years ago

@jprante i'm sadly no java geek, but i found at the ES repo this approach for Hunspell. They use a service, so the dictionary is only loaded once.

Maybe this would ge a good idea?

https://github.com/elastic/elasticsearch/blob/ee802ad63c0f21d697a5095dd05dc6f94626ee4d/core/src/main/java/org/elasticsearch/index/analysis/HunspellTokenFilterFactory.java#L44 https://github.com/elastic/elasticsearch/blob/ee802ad63c0f21d697a5095dd05dc6f94626ee4d/core/src/main/java/org/elasticsearch/indices/analysis/HunspellService.java

jprante / elasticsearch-plugin-bundle

Baseform: memory optimization #29