rwynn / monstache

a go daemon that syncs MongoDB to Elasticsearch in realtime. you know, for search.
https://rwynn.github.io/monstache-site/
MIT License
1.28k stars 181 forks source link

Re-deployment Issue on Kubernetes cluster #498

Open pcinnusamy opened 3 years ago

pcinnusamy commented 3 years ago

Hi @rwynn ,

We are running the Monstache service on kubernetes cluster with below configuration. Whenever redeploying the package for specific change, the entire package will be executed and considered as fresh deployment and full sync is happening every time since its stateful. So we thought of using "resume- name" config to create file name under kubernetes cluster to resume the Monstache process from the last read position. Unfortunately we were unable generate this file under Kubernetes cluster, but Monstache has successfully created resume at collection level.

Should we use cluster-name config in order create resume-name="default" file under this cluster ?

Please can you advise how to resume the service while deploying package under K8s cluster.

 direct-read-namespaces = ["ll1.sample"]
change-stream-namespaces = ["ll1.sample"]
gzip=true
index-stats = true
elasticsearch-healthcheck-timeout-startup = 60
elasticsearch-healthcheck-timeout = 60
elasticsearch-max-conns = 10
elasticsearch-validate-pem-file = false
elasticsearch-retry = true
elasticsearch-max-docs = 1000
elasticsearch-max-seconds = 10
dropped-collections = true
dropped-databases = true
prune-invalid-json = true
resume = true
resume-strategy = 1
resume-write-unsafe = false
resume-name = "default"
exit-after-direct-reads = false
direct-read-concur = 1
verbose = true
stats = true
enable-http-server = true

[logs]
warn = "warn.log"
error = "error.log"

[[mapping]]
namespace = "ll1.sample"
index = "staging-ll1-sample-monstache"

Capture

rwynn commented 3 years ago

If I understand the issue correctly you don't want to do a full sync after each redeployment. If that is the case have you tried turning on stateful direct reads?

direct-read-stateful = true
pcinnusamy commented 3 years ago

@rwynn , Thanks for quick revert and solution that you provided. Your understanding is correct w.r.t directreads. I can see the explanation from Monstache website that it's stated as "On subsequent restarts monstache will check this collection and only start direct reads for the namespaces not in the completed list". Here we are deploying each collection into separate K8s pod and no need to check the collection status whether full sync got completed instead of checking last read position. Could you please confirm whether this will helpful when Monstache is running on "change streams" mode , which means when we are re deploying the package Monstache should check last read position and then resume the service after deployment. Thanks

rwynn commented 3 years ago

When you use resume-strategy=1 then instead of a single timestamp stored in monstache.monstache you will get separate resume tokens per stream in monstache.tokens. A stream is whatever you put in change-stream-namespaces. It might be a collection name, or a db, or empty string in the event of a single stream on the entire deployment. So in your case you would have a separate resume token per collection in monstache.tokens. The documents in this collection will look like...

 { 
   "resumeName": "default",
   "streamID":   "db.collection",
   "token":      "<token>"
 }

Monstache will read this token when it starts and restart the change stream from the last recorded position.

pcinnusamy commented 3 years ago

Thanks for your prompt support @rwynn . When we use resume-strategy=0 , will Monstache pick last recorded position automatically from timestamp and then resume the service right . Please correct me if am wrong. "Strategy 0 -default- Timestamp based resume of change streams. Compatible with MongoDB API 4.0+."

Also where should I use above code snippet If want t go with Token based resume. Thanks