richardwilly98 / elasticsearch-river-mongodb

MongoDB River Plugin for ElasticSearch
1.12k stars 215 forks source link

Is There a Setting to Configure how often ES indexes from MongoDB? #122

Open 1manStartup opened 11 years ago

1manStartup commented 11 years ago

First off, thanks for creating and maintaining this plugin. So im currently using mongodb to input data and this plugin with elasticsearch to search/filter the data. I noticed that as soon as I post an item to mongodb it is not immediately indexed and searchable by elasticsearch. Is this by design for performance issues or is it possible to index from mongodb each time an item is posted or possibly every second or so? When building some real time apps this would be useful.

richardwilly98 commented 11 years ago

What kind of latency do you get?

The plugin uses tailable cursor for oplog.rs capped collection so to my knowledge that should be close to real-time.

richardwilly98 commented 11 years ago

By default ES actually refreshes the index 1 second (see refresh_interval setting [1]) so you could have latency of 1 sec at max. If you decide to change this settings beware of possible performance impact at indexing time [2].

[1] - http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings/ [2] - http://blog.sematext.com/2013/07/08/elasticsearch-refresh-interval-vs-indexing-performance/

vantroy commented 11 years ago

Hi Kevin,

Are you by any chance mentioning the case where you have a large index (couple of gigs, at least) and when you add a new Type river the new objects take a few minutes to index? I noticed that situation and what happens is that when the new river is created, it scans the entire oplog of mogodb for stuff in the collection refered. Depending on your oplog size, it can take a while. In our development setting where we drop and re-import a few million items daily, it can take up to 10 minutes for a new river to go live, and we alleviate it by recreating the replica set and doing a mongo restore to clean the crud out of the oplog.

Please disregard if it's not your case, your post just caught my eye.

Cheers,

Rodrigo

On Tue, Sep 10, 2013 at 9:53 AM, Richard Louapre notifications@github.comwrote:

By default ES actually refreshes the index 1 second (see refresh_interval setting [1]) so you could have latency of 1 sec at max. If you decide to change this settings beware of possible performance impact at indexing time [2].

[1] - http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings/ [2] - http://blog.sematext.com/2013/07/08/elasticsearch-refresh-interval-vs-indexing-performance/

— Reply to this email directly or view it on GitHubhttps://github.com/richardwilly98/elasticsearch-river-mongodb/issues/122#issuecomment-24156981 .

richardwilly98 commented 11 years ago

In release 1.7.0 you can specify the datetime when documents will be indexed [1].

[1] - https://github.com/richardwilly98/elasticsearch-river-mongodb/commit/57fc1c7867a24211d2acd92ca7b50d5859ad2c27

1manStartup commented 11 years ago

Thanks for your reply. @richardwilly98 My search latency is about 10ms, my dataset is small. My problem was with setting the mongo oplog correctly. I had several collections and I was possibly experiencing what @vantroy mentioned above in that it takes some time to go live. I will do some further testing to see what I was doing wrong since I just start with ES this past week.

richardwilly98 commented 11 years ago

@1manStartup sure no problem. Please keep us updated.

richardwilly98 commented 11 years ago

@vantroy the initial import has been changed since 1.7.1 it now uses the collection data (instead of oplog.rs). When the initial data are imported the river will then use oplog.rs