rethinkdb / logstash-input-rethinkdb

Other
55 stars 13 forks source link

How usable is this with the release of 2.2? #1

Closed OriginalEXE closed 8 years ago

OriginalEXE commented 8 years ago

Hello,

with RethinkDB 2.2 having been released, how close is this plugin to being usable in production?

Out of the two "major" issues mentioned in the readme, first one seems to be covered in 2.2 while the second one seems to require waiting for 2.3?

Thanks

danielmewes commented 8 years ago

Hi @OriginalEXE , we'll update the plugin in the next week or so to take advantage of includeInitial in RethinkDB 2.2. This will allow the plugin to initialize the data in ElasticSearch (or another destination) on startup, and then keep it updated and consistent.

The effect of the second limitation is that if the connection to the RethinkDB server is lost (because of networking issues, a scheduled server restart, hardware or software failure etc), the process needs to be started over from scratch in order to keep the data in ElasticSearch consistent with RethinkDB. Basically the old copy of the data in ElasticSearch has to be deleted, and then the backfill process has to be started over to re-initialize it. This will typically be fast for small to moderate data sets, but can obviously take a while for large ones.

Whether the remaining second limitation is ok for you depends on how big your data set is, and if you can tolerate a few seconds to minutes of downtime after a server restart, during which your ElasticSearch data needs to be re-initialized.

barkerja commented 8 years ago

Basically the old copy of the data in ElasticSearch has to be deleted, and then the backfill process has to be started over to re-initialize it.

Not necessarily true. If you provide an ID in your documents when indexing the data in ES, the operation is idempotent.

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html The index API adds or updates a typed JSON document in a specific index, making it searchable.

danielmewes commented 8 years ago

@barkerja The problem are documents that are deleted in RethinkDB. Those wouldn't appear in the backfill and hence they would never get deleted in ElasticSearch. I think it makes sense to have the "delete everything" step of backfilling be optional. In some scenarios deletions might be rare or it might be tolerable to have a few old documents stick around.

barkerja commented 8 years ago

The problem are documents that are deleted in RethinkDB. Those wouldn't appear in the backfill and hence they would never get deleted in ElasticSearch.

@danielmewes Ah yes, very good point. With that said, will 2.3's resumable feeds replay deleted documents?

OriginalEXE commented 8 years ago

Awesome @danielmewes, 2.) is actually not a big issue for me as I'm actually still in the development process so the only dataset I have is a testing one. Thanks for the reply and your work.

danielmewes commented 8 years ago

@barkerja Resuming over server restarts will actually need a stronger variant of resumable changefeeds than the one we're planning for 2.3. Those will only help to survive connection drops. We also have plans for the stronger variant, but don't have it scheduled for a specific release yet.

deontologician commented 8 years ago

PR up at #2 . The logstash event api isn't really capable of doing a full deletion of documents in the output plugin, so all we can do is backfill on connection.

OriginalEXE commented 8 years ago

Awesome work guys, looking forward to playing with this. I guess this issue can be closed now. Thanks