rwynn / monstache

a go daemon that syncs MongoDB to Elasticsearch in realtime. you know, for search.
https://rwynn.github.io/monstache-site/
MIT License
1.28k stars 180 forks source link

Monstache fails to keep data in sync, edit: with mongo secondary #128

Closed sbnajardhane closed 5 years ago

sbnajardhane commented 5 years ago

Hi @rwynn The plugin is syncing data but not all the time. Yesterday, I made a bulk operation on my user collection (50k user records got updated) out of which 4000 user records were not synced to the ES. Nothing is there in logs. This has happened before as well. I am unable to figure out the cause, appreciate if you could help a bit. thanks Here is my config

mongo-url = "mongodb://mongo_secondary_node:27017"

elasticsearch-urls = ["es-cluster-url"]

namespace-regex = 'test_db.user'

gzip = true

index-as-update = true

fail-fast = false

prune-invalid-json = true

index-oplog-time = true

resume = true

# replay = true

cluster-name = "HA"
resume-name = "HA"

[logs]
info = "/path/monstache/logs/info.log"
warn = "/path/monstache/logs/warn.log"
error = "/path/monstache/logs/error.log"
trace = "/path/monstache/logs/trace.log"

# Oplog read operation timeout setting.
[mongo-dial-settings]
timeout=15
read-timeout=0
write-timeout=0

[mongo-session-settings]
socket-timeout=0
sync-timeout=0

[[mapping]]
namespace = "test_db.user
index = "user"
type = "_doc"

[[script]]
namespace = "test_db.user"
script = """
 ## some logic here
"""

Also is there any way to check if data is in full sync with mongo or not ??

rwynn commented 5 years ago

Hi @sbnajardhane there is not a way to check if data is in full sync. You might want to do a run with

direct-read-namespaces = ['ss_db.ssfbusermapping']

Which will do a read of the entire collection and sync it.

I may need to add a debug mode but until that you can try verbose=true and stats=true which will give you the whole request-response output in the trace log and more info a stats log.

Finally, make sure your script is always returning a valid document. Falsy values coming out of a script indicate to monstache that you want the document dropped from the index.

rwynn commented 5 years ago

@sbnajardhane you can also try latest release. I've added some code to ensure that errors reported by Elasticsearch are logged. That may tell us what the problem is.

sbnajardhane commented 5 years ago

@rwynn Thanks for the quick response.

I will enable be stats and monitor the log. will post it here if found any error log.

The script is just adding few new fields and returning a doc, so the script will return true value always

module.exports = function(doc) {
    if (typeof doc.user_points != "undefined") {
        doc.user_points.user_total_points = doc.user_points.points_1 +
                                        doc.user_points.points_2 +
                                        doc.user_points.points_3;
    }
    return doc;
}

I am on version 4.11.3

sbnajardhane commented 5 years ago

Hi @rwynn

I have monitored the trace log and stats. Monstache is working fine, got no error messages, plugin is listening for all the event happening to collections. (y)

Still facing sync issue... Here is my observation, let me try to give few details about it. In my application I have delete user functionality

user = get_user()
# users Initial data
# user.field_1 = "test field_1 value"
# user.field_2 = "test field_2 value"
# user.field_3 = "active"

user.update(set__field_1 = "")
user.update(set__field_2 = 0)
user.update(set__field_3 = "deleted")

Oplog entries for this document, in this order

"o" : {
        "$set": {"field_1": ""}
}

"o" : {
        "$set": {"field_2": 0}
}

"o" : {
        "$set": {"field_3": "deleted"}
}

In trace log 1st entry

"doc": {
        "field_1": "test field_1 value",
        "field_2": "test field_2 value",
        "field_3": "deleted"
}
"res": {
       "version": 1,
       "error": false
}

2nd Entry

"doc": {
        "field_1": "",
        "field_2": "test field_2 value",
        "field_3": "active"
}
"res": {
       "version": 2,
       "error": false
}

3rd Entry

"doc": {
        "field_1": "",
        "field_2": 0,
        "field_3": "active"
}
"res": {
       "version": 3,
       "error": false
}

This is not occurring all the time but for the few records. (affected 5000 document per 50000 average) I have enabled the index-as-update = true, so only updated fields should get synced, right?

So I have couple of questions.

  1. Why every time all the fields are getting passed as a request body and not only the updated one?
  2. Why the 2nd and 3rd update sync call is not fetching the updated values of a fields?

assuming from the above series of events. ->

  1. The plugin is reading the entry from db
  2. fetching the document from the db (So is it possible that it is reading the old entries before its getting updated to the data.)
  3. Making http request to push the data into es index Meanwhile I will look in the code flow.

It would be great if we can discuss this issue over a call. It might be difficult for me to explain this here in comments. Here is my email id sbnajardhane@gmail.com please share your contact details and the suitable timing. (I am from India, timezone: IST)

rwynn commented 5 years ago

Monstache always sends the full document as a matter of keeping it simple. Index as update means that any non overlapping fields in an existing Elasticsearch document will not be overwritten.

rwynn commented 5 years ago

Is there any reason you are doing 3 separate updates in your application instead of putting all 3 changes of the user into a single bson update?

rwynn commented 5 years ago

To ensure that changes are applied serially you cannot use index-as-update = true. Only the default of index-as-update = false will result in version numbers being sent to Elasticsearch to ensure operations are applied serially.

rwynn commented 5 years ago

Can you try with the following settings? I am currently investigating some strange behavior when the timeout is turned off (set to 0).

[mongo-dial-settings] timeout=15 read-timeout=7 write-timeout=7

[mongo-session-settings] socket-timeout=0 sync-timeout=7

sbnajardhane commented 5 years ago

Monstache always sends the full document as a matter of keeping it simple. Index as update means that any non overlapping fields in an existing Elasticsearch document will not be overwritten.

So it should fetch the latest document replace the values of updated field and then write to es. In this way the sequence of operations should not matter. when 3rd mongo update executed first and 1st mongo update in the last, then full document should contain the latest values.

To ensure that changes are applied serially you cannot use index-as-update = true.

If I disable it then on update my previous data will be overwritten, right?

Is there any reason you are doing 3 separate updates in your application instead of putting all 3 changes of the user into a single bson update?

I cannot combine them as this is just an simple example what is happening, in application its performing some complex operations and then performing those operations.

Can you try with the following settings? ok, will try these setting?

rwynn commented 5 years ago

for The most part sequence of operations does not matter but if deletes are synced out of order then it causes problems.

You can check the code in rwynn/gtm for how it works. I think the func is called FetchDocuments that occurs to get the full Document on an update. This is not used if you use change streams though since it’s built in. Change streams require mongo 3.6+.

sbnajardhane commented 5 years ago

for The most part sequence of operations does not matter but if deletes are synced out of order then it causes problems.

Just bit of a confusion, I am not using the mongo delete, I am just updating the the a field in the document as deleted, doing soft delete(status = "deleted") I just took this as an example as I was debugging this functionality.

rwynn commented 5 years ago

Yes you should be good with soft delete. I was just adding comment for understanding. I should have said hard delete.

sbnajardhane commented 5 years ago

@rwynn

This is not used if you use change streams though since it’s built in. Change streams require mongo 3.6+.

I didn't get this part ?

rwynn commented 5 years ago

If you are on MongoDb 3.6 + then you can use the new change streams api by setting change-stream-namespaces option. If not then Monstache will tail the oplog.

sbnajardhane commented 5 years ago

I am on mongo 3.0

sbnajardhane commented 5 years ago

I Will try out the timeout setting and update here. Thanks :+1:

rwynn commented 5 years ago

I'm wondering if this could be related to use the MongoDB secondary in your connection string? I don't know that's just a guess. I cannot seem to replicate this issue by just issuing some simple changes...

use test2;
for (var i=0;i<10000;++i) db.test.insert({foo:i});
db.test.update({}, {$set: {foo: 0}}, {multi:true});
db.test.update({}, {$set: {foo: 1}}, {multi:true});

while running monstache without any special configuration besides the connection strings.

rwynn commented 5 years ago

It's possible that the ordering of the operations could be the issue. Consider if monstache fetches the data after the 2nd update (stale data), and then quickly fetches it again after the 3rd update (current data), but the actual indexing of of the stale data happens after the current data. Then you would see it out of sync.

You can fix that by turning off index-as-update (stale data would have lower version number than current data so it would get rejected coming in later) or if you are actually doing your own modifications to the data in Elasticsearch (out of band updates) and you need index-as-update then you could try to mitigate the out of order possibility by using these settings...

elasticsearch-max-conns = 1

[gtm-settings]
buffer-size = 5000
buffer-duration = 4s

The max-conns setting will ensure that only 1 go routine is pushing data to Elasticsearch instead of 4 by default. This will help bulk operations to be serialized.

The gtm-settings say only fetch corresponding documents for updates after 4 seconds has passed or 5000 updates are queued up. This will help mitigate the possibility of stale data when multiple updates are performed together in quick succession. The downside would only be that update have a little longer latency to showing up in Elasticsearch.

rwynn commented 5 years ago

In addition to the settings recommended, I just pushed a new release that you can try. Let me know how that goes.

sbnajardhane commented 5 years ago

Yes, monstache is connected to the mongo secondary node. we were also thinking about the same, it might be possible that data might not get written to the secondary before the monstache push the data to es. I have added the gtm-setting with new build and executed the bulk operations, it worked :+1:

We are connecting to the mongo secondary node to avoid load on primary node. Is there any down side if i connect to the mongo primary (like high connections alerts, db getting slow?) What you would recommend, connecting to primary or secondary? Or what mostly people are using ?

mapshen commented 5 years ago

@sbnajardhane

In your case, I would read from primary.

https://docs.mongodb.com/manual/core/read-preference/ is a good reference. Using secondary read preference or not depends much on your tolerance of staleness and I would use it when I have a long running job doing some heavy computations and a delay of minutes or even hours wouldn't concern me.

sbnajardhane commented 5 years ago

@mapshen

I am using secondary just to avoid unnecessary load on my primary node. So the plugin won't affect the performance of my db then I am good to go with the primary. @rwynn Do you have views about the primary/secondary preferences.

I have tested the syncing by adding delay config and connecting to the mongo secondary AND by replacing the mongo url with primary node -> both worked fine. ()

rwynn commented 5 years ago

I would say maybe read from the primary based on the info here

https://docs.mongodb.com/manual/core/read-preference/#counter-indications

sbnajardhane commented 5 years ago

Thanks @rwynn Resolved.