rwynn / monstache

a go daemon that syncs MongoDB to Elasticsearch in realtime. you know, for search.
https://rwynn.github.io/monstache-site/
MIT License
1.28k stars 181 forks source link

Data incomplete synchronization #679

Open axpwx opened 1 year ago

axpwx commented 1 year ago

mongodb:

db.t_posts.count()
19836016

ES:

11180032

config.toml:

mongo-url = "mongodb://10.10.0.4:27017,10.10.0.5:27017,10.10.0.6:27017/db_clogs"
elasticsearch-urls = ["http://10.10.0.14:9200"]

cluster-name = 'clogsTransport'
workers = ["worker0", "worker1", "worker2"]

direct-read-no-timeout = true
direct-read-split-max = 9
replay = false
resume = true
direct-read-concur = 0
direct-read-namespaces   = ["db_clogs.t_posts"]
[[mapping]]
  namespace = "db_clogs.t_posts"
  index = "db_clogs.t_posts"

Exit normally, no exception log

INFO 2023/04/02 15:51:43 Started monstache version 6.7.11
INFO 2023/04/02 15:51:43 Go version go1.18.9
INFO 2023/04/02 15:51:43 MongoDB go driver v1.11.3
INFO 2023/04/02 15:51:43 Elasticsearch go driver 7.0.31
INFO 2023/04/02 15:51:43 Successfully connected to MongoDB version 5.0.12
INFO 2023/04/02 15:51:43 Successfully connected to Elasticsearch version 7.10.1
INFO 2023/04/02 15:51:43 Sending systemd READY=1
INFO 2023/04/02 15:51:43 Joined cluster clogsTransport
INFO 2023/04/02 15:51:43 Starting work for cluster clogsTransport
INFO 2023/04/02 15:51:43 Listening for events
INFO 2023/04/02 15:51:43 Resuming from timestamp {T:0 I:0}
INFO 2023/04/02 15:51:43 Resuming from timestamp {T:0 I:0}

INFO 2023/04/02 16:07:30 Stopping all workers
INFO 2023/04/02 16:07:30 Shutting down
INFO 2023/04/02 16:07:32 Direct reads completed

re-run many times, the result is the same. The t_posts._id field of mongodb is a custom hexadecimal value, like: ObjectId("00000033e5910ccd053f679c") ObjectId("ffffff90af74b6e8e3d023ff")

axpwx commented 1 year ago

If I remove workers = ["worker0", "worker1", "worker2"], the synchronization result is correct.

rwynn commented 1 year ago

Hi @axpwx when you run with workers enabled you need to have 1 monstache process per worker. This is because workers means that, e.g. 3 processes will read from MongoDB but each process will only index 1/3 of the documents read.

https://rwynn.github.io/monstache-site/advanced/#workers