pelagios / peripleo2

The Pelagios Exploration Engine
Other
21 stars 1 forks source link

"Heavily conflating" data crashes index #199

Closed rsimon closed 6 years ago

rsimon commented 6 years ago

Data that will cause large amounts of conflation (in other words, gazetteers) currently crash the index, if it is large already (>500k items). I have yet to track down what the actual problem is (memory?). We will need to find a way around this.

In principle, Peripleo code runs the ingest sequentially (per dumpfile), and waits for individual index operations to finish. Also, it's clearly the index that crashes, not the application. Therefore, future upgrades to newer ElasticSearch versions might resolve the issue. Either way, we'll need to find out.

rsimon commented 6 years ago

This looks more like an issue of indexing records with large/complex geometries, rather than the conflation issue described above. Added the fix recommended at https://github.com/elastic/elasticsearch/issues/22087#issuecomment-334759601. This seems to improve things considerably.

rsimon commented 6 years ago

While out-of-memory errors due to large geometries seem to be resolved, heavy conflation seems to be an issue, too. Just encountered this now with the ToposText gazetteer dataset.

rsimon commented 6 years ago

It looks as if this is indeed nothing on the application side. At some point, ElasticSearch simply seems to get clogged. As a workaround, I introduced a 5s wait as soon as indexing a single item takes longer than 1s.

Other than that, the only thing we can likeyl do is upgrade to newer versions of ES. (Which was so far not possible due to Play framework version conflicts, but will be one of the first major milestones in the next dev face fingers crossed)