rwynn / monstache

a go daemon that syncs MongoDB to Elasticsearch in realtime. you know, for search.
https://rwynn.github.io/monstache-site/
MIT License
1.28k stars 180 forks source link

Not copying complete mongo view to ES index using direct-read-namespaces #452

Open sachinnagesh opened 3 years ago

sachinnagesh commented 3 years ago

First of all thanks @rwynn for this amazing library to sync data from mongodb. I am using

monstache 6.7.0
mongodb version mongo:4.2.9
elasticsearch:7.7.0

I am facing a issue where my whole view is not getting completely copied to ES index. I am having around 3M records for view. On first deployment I want completely copy view from mongo to es and also sync any real time operations from mongo view to ES index. When I deploy monstache it's partially copying view and marking it as complete in directreads collection under monstache database. My config looks something like this,

mongo-url =""
elasticsearch-urls =[]
direct-read-namespaces=["user-db.users_view"] #copy view to es index completely
change-stream-namespaces=["user-db.users","user-db.user_personal_details","user-db.user_address_details","user-db.user_family_details"]

gzip = true
stats = true
index-stats = true
elasticsearch-max-conns = 2
elasticsearch-max-docs = 1000
dropped-collections = false
dropped-databases = false
replay = false
resume = true
resume-write-unsafe = false
resume-name = "default"
resume-strategy = 1
file-highlighting = true
verbose = true
cluster-name = "MONSTACHE_CLUSTER"
exit-after-direct-reads = false
direct-read-split-max = -1
direct-read-stateful = true
elasticsearch-retry = true
prune-invalid-json = true

[gtm-settings]
buffer-duration = "100ms"

[[mapping]]
namespace = "user-db.users_view"
index = "user-db.users_view-index"

[[relate]]
namespace = "user-db.users"
with-namespace = "user-db.users_view"
keep-src = false

[[relate]]
namespace = "user-db.user_personal_details"
with-namespace = "user-db.users"
src-field = "_id"
match-field = "_id"
keep-src = false

[[relate]]
namespace = "user-db.user_address_details"
with-namespace = "user-db.users"
src-field = "_id"
match-field = "_id"
keep-src = false

[[relate]]
namespace = "user-db.user_family_details"
with-namespace = "user-db.users"
src-field = "_id"
match-field = "_id"
keep-src = false

I am passing mongo and ES config as environment variable. Similar configuration works on other enviroments where data is around 2 lakh records.

Only getting this error in log which I am getting for other env where it works fine.

ERROR 2020/11/11 11:34:55 (Unauthorized) not authorized on admin to execute command { serverStatus: 1, lsid: { id: UUID("5399l9sd-bfc2-47bv-9ed5-9c33ee454af5") }, $clusterTime: { clusterTime: Timestamp(1605094489, 3), signature: { hash: BinData(0, D2D146CC709E121C6782C8211CECF4CA048D83C2), keyId: 8951002300781455875 } }, $db: "admin" }
rwynn commented 3 years ago

Hi @sachinnagesh the configuration you have looks good to me given what you are trying to accomplish.

Is there any pattern to the docs that are missing? About how many docs are missing?

Any information in the monstache.stats.yyyy-mm-dd collection about errors?

You may want to try monstache v6.7.1 which includes MongoDB driver upgrades just in case that is relevant.

sachinnagesh commented 3 years ago

@rwynn Thank you for your response

Is there any pattern to the docs that are missing? About how many docs are missing? => I tried several times by clearing es index data and monstache database entries, it copied records in range of 8 lakh to 12 lakh and stopped copying.

Any information in the monstache.stats.yyyy-mm-dd collection about errors? => Failed document count is there.

{"index":{"_index":"monstache.stats.2020-11-11"}}
{"Host":"c21ff46c2f1a","Pid":253,"Stats":{"Flushed":3130,"Committed":3573,"Indexed":400546,"Created":0,"Updated":0,"Deleted":58,"Succeeded":398066,"Failed":2538,"Workers":[{"Queued":0,"LastDuration":261000000},{"Queued":0,"LastDuration":72000000}]},"Timestamp":"2020-11-11T09:06:06"}

{"took":54,"errors":false,"items":[{"index":{"_index":"monstache.stats.2020-11-11","_type":"_doc","_id":"TW-OtnUBf9Z0FIwLKpe6","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":241,"_primary_term":1,"status":201}}]}

{"index":{"_index":"monstache.stats.2020-11-11"}}
{"Host":"c21ff46c2f1a","Pid":253,"Stats":{"Flushed":3139,"Committed":3585,"Indexed":400605,"Created":0,"Updated":0,"Deleted":58,"Succeeded":398117,"Failed":2546,"Workers":[{"Queued":0,"LastDuration":368000000},{"Queued":0,"LastDuration":535000000}]},"Timestamp":"2020-11-11T09:06:38"}

{"took":39,"errors":false,"items":[{"index":{"_index":"monstache.stats.2020-11-11","_type":"_doc","_id":"aCuOtnUB7gYr86GnnyDq","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":195,"_primary_term":1,"status":201}}]}
sachinnagesh commented 3 years ago

@rwynn One thing I forgot to mention, I am giving very less resources to it. I have given 1 CPU core and 1 GB memory, as I am fine if it takes time for complete sync and also I don't want to put load on mongodb instance. One more monstache instance I am running for different purpose with different cluster-name which is synching data to same ES cluster from same mongodb cluster.

sachinnagesh commented 3 years ago

@rwynn I gave 4 CPU core and 4GB RAM, now it's copying more records. It copied 1.7M records and again it stopped copying. I am deploying monstache in HA mode by deploying two containers with same cluster-name, so one will be active node at a time. I have a doubt, what if during copying of view from mongo to es, what if my current active node go in pause state and other node became active node, will it start copying that view from very first record again? or will it resume from last record copied by old active node?

rwynn commented 3 years ago

@sachinnagesh you might try without HA mode since then the code path will be simpler and less chance for deadlock. In HA mode the 2nd process will repeat the full copy of the collection unless it is stateful read and the 1st process has already marked it as complete. There is no concept of resuming the direct read in monstache, only resuming the change stream.

sachinnagesh commented 3 years ago

@rwynn Deploying a single node without HA worked for me. Somehow can we achieve resume for direct read in monstache? It will be a very nice feature to have when data size is too large. Currently either we need to deploy two different instances one for direct read without HA and other for change stream with HA as we need to achieve HA or for very first deployment, deploy monstache without HA and once complete direct read is done then again redeploy it with enabling HA.