logstash-plugins / logstash-input-couchdb_changes

This plugin captures the _changes stream from a CouchDB instance
Apache License 2.0
27 stars 22 forks source link

Import very slow with a large database #24

Open FlorinAndrei opened 8 years ago

FlorinAndrei commented 8 years ago

Logstash-2.2.2, CouchDB-1.6.1, Ubuntu-14.04. I'm throwing the data into ElasticSearch, either 1.x or 2.x - doesn't matter, behavior is the same.

This may appear to be related to issue #11 but I believe this might be different actually.

Basically, with a CouchDB database of 30 million records, I can never get it to sync up. It parses the first 1 million documents quickly (thousands per second), then it slows down. By the time it's around 15 million, it's only updating about 5 / second. It will never finish with a much larger DB.

Restarting Logstash does not help at all (this is different from issue #11 where restart makes it fast again).

Logstash and ElasticSearch have enough memory, and are not swapping. When it's slowed down, ES doesn't use almost any CPU at all.

I believe this is related to the degree of parallelism. couchdb_changes plugin (Couch to Logstash to ES), River (Couch to ES), and CouchDB replication (Couch to Couch) - they all use the _changes API in CouchDB. The behavior is the same for all three. They all start out very fast, but slow down after a few million records have been imported.

The difference is that CouchDB replication by default runs with a high degree of parallelism, about 50 connections or so. I've had good results bumping it up to these parameters, on a connection across the continent (70 ms latency):

"http_connections": 200, "worker_processes": 40, "worker_batch_size": 1000

I've done some tests with River. "max_concurrent_bulk" : 100 makes a huge difference, it's very fast, it grabs the first 1 million documents in 5 minutes, then it slows down somewhat. Bumping it up to 1000 doesn't seem to improve it much, there's probably a limit somewhere. I didn't have time today to test it with the big DB to the end. Tomorrow I'll test it with a 30 million docs DB.

FlorinAndrei commented 8 years ago

I guess what I'm saying is - there appears to be no way to tell couchdb_changes to do many queries in parallel on CouchDB, with a config item to allow to change the number of parallel queries. Having something like this would be very helpful in speeding up the initial sync-up phase.

untergeek commented 8 years ago

How do you parallelize (cleanly), a serial stream output? Any extra threads will likely re-read the same data, which is redundant. The _changes endpoint is for listening to, rather than being something you can query. Yes, you can specify which id to start at, but that's a different thing entirely.

I'm curious how you'd see this being handled in parallel.

salsa-dev commented 6 years ago

@FlorinAndrei have you resolved this issue somehow?