rwynn / monstache

a go daemon that syncs MongoDB to Elasticsearch in realtime. you know, for search.
https://rwynn.github.io/monstache-site/
MIT License
1.28k stars 181 forks source link

Difference in document count between mongodb and elasticsearch #204

Open sameerkattel opened 5 years ago

sameerkattel commented 5 years ago

We are using monstache to sync mongdb to es. There was a document count differences in mongodb and elasticsearch. And seeing this we again did the full sync of mongodb to elasticsearch from the beginning. Now the sync is completed, the difference still remains. What can cause the differences in doucment counts between mongo and es? In mono count is : 222632036 es count is : 222629754

rwynn commented 5 years ago

Assuming you are not doing any filtering then failures in indexing. These should be logged by monstache.

sameerkattel commented 5 years ago

Yes there is no filtering done. I am running monstache in docker container and i don't see any error logs in docker logs.

rwynn commented 5 years ago

Are you using direct reads to sync all the collections with the indexes? This usually works for me to copy all the data.

sameerkattel commented 5 years ago

Yes I am using direct reads to sync all the data in one collection. It used to work .. For sometime the count was in sync and only lately the count started to differ and because of that i tried syncing data from beginning but that did not help.

rwynn commented 5 years ago

That's a mystery to me. It should be printing all bulk line items with errors using this callback: https://github.com/rwynn/monstache/blob/master/monstache.go#L379

Another possibility is that some data is MongoDB is not able to be serialized to json for sending. However, that error also should be getting returned and eventually printed: https://github.com/rwynn/monstache/blob/master/monstache.go#L2717

Is it also listening for change events on this collection? You will need that if you are changing the collection while you are reading it in a direct-read.

rwynn commented 5 years ago

The replay option will not work as a full sync unless your oplog is very large. Usually, this would require you to increase it's size via configuration. Since the oplog is a capped collection eventually the old data gets dropped.

That is why a direct-read is better for full sync. Usually, you have monstache also listening for new changes while the direct reads are being performed.

sameerkattel commented 5 years ago

I am just wildly guessing it's some serialization issue but logs does not support it. And yes I am doing direct-read for full sync with listening for change events.

thenative commented 5 years ago

mongo-version : 4.3 ES - Version : 6.8.1

I see a similar mismatch in countDocuments() from a list of mongo collections and the number of documents which actually get indexed in ES.

dachuylinux commented 4 years ago

I have same issue and nothing in error logs. i used direct-read and listen change-stream for only one collection. I'm using Mongodb 4.2: total of collection is 22129296 when i sync data to Elasticsearch 7.6.2: count of index is 22129286. i have tried again 5 times. but the count of index Elasticsearch always lost 10

dachuylinux commented 4 years ago

I have found the problem. In Mongodb, i have 10 documents with type of field _id is ObjectId different from other ids (type String), so I think this tool direct-read from MongoDB cannot find id by type ObjectId.

rwynn commented 4 years ago

Hi @dachuylinux, is it possible that these 10 documents are strings that look like ObjectId as hex? When monstache sends to Elasticsearch it needs to send a string. So ObjectId would be converted to string using hex representation. Is it possible these 10 document ids share same value as another id (type ObjectID) in the collection converted to hex?