rwynn / monstache

a go daemon that syncs MongoDB to Elasticsearch in realtime. you know, for search.
https://rwynn.github.io/monstache-site/
MIT License
1.29k stars 182 forks source link

Monstache missing some documents to sync #447

Open chandra2037 opened 4 years ago

chandra2037 commented 4 years ago

We have two Monstache instances deployed in EKS cluster. Each instance is independently deployed.

Instance 1 - Monstache configuration:

elasticsearch-max-conns = 12
elasticsearch-max-bytes = 8000000
gzip = true
direct-read-split-max = 9
resume = true
resume-strategy = 0
resume-name = "default" 
namespace-exclude-regex = '^.*DB\.(classification|collection|publication).*$'
verbose = false
index-oplog-time = true
oplog-date-field-format = "2006-01-02T15:04:05.999Z"
[gtm-settings]
      buffer-size = 128
      channel-size = 512
      buffer-duration = "75ms"

Instance 2 - Monstache configuration: Same as instance 1 config except instead of namespace-exclude-regex configured the following

namespace-regex = '^.*DB\.(classification|collection|publication).*$'

The idea is - Instance 2 will index the documents from configured collections and Instance 1 will index from the rest of the collections.

Everything works fine for a while, but after that, we are seeing discrepancies between the MongoDB collection document count vs Elastic Index count.

Notes:

  1. Mongo oplog has the records for the missing documents, earlier we had issue with missing oplog records.
  2. Checked monstache.monstache collection for resume timestamp, it is greater than the oplog record ts
  3. No errors found in the Monstache or Elastic
  4. Cannot confirm but, it may be happening when bulk inserted the documents into Mongo, some thing like below image

Can you please advise?

Thank you for your help

rwynn commented 4 years ago

Hi, what version of monstache are you using?

I think if not already you should use a different resume-name for each monstache process as they should not share resume state.

Are you able to use direct reads to do a full sync to get a matching count?

chandra2037 commented 4 years ago

Thank you for your response @rwynn

We were using 6.5.4, but recently upgraded to the latest 6.7.0 and still having the same issue.

Sorry forgot t mention that indeed using a different resume-name for each instance, I can see two records in the monstache.monstache collection.

Able to do the full documents sync by running Monstache with direct-read-namespaces config. But syncing from oplog is crucial to our systems, any help will be appreciated.

rwynn commented 4 years ago

I think it would be difficult to diagnose this one without a script to reproduce it. Have you experimented with some batch inserts to MongoDB to see if that is causing a problem?

chandra2037 commented 4 years ago

I have yet to create a script to reproduce the error. But in the meantime for some reason, I am seeing the following messages in the logs

ERROR 2020/11/19 06:21:40 elastic: bulk processor "monstache" failed: elastic: Error 400 (Bad Request): Validation Failed: 1: id is missing; [type=action_request_validation_exception]

TRACE 2020/11/19 06:21:39 HTTP/1.1 400 Bad Request
Content-Length: 227
Content-Type: application/json; charset=UTF-8

{"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation Failed: 1: id is missing;"}],"type":"action_request_validation_exception","reason":"Validation Failed: 1: id is missing;"},"status":400}

And when I restarted the Monstache instance Kubernetes Pod, these errors disappeared. Trying to find the root cause for these errors, reviewed the documents, there is nothing unusual about the data as the same documents getting published okay after the restart.

Can you please advise on why and when these errors will occur?

chandra2037 commented 4 years ago

Update on the issue:

  1. Found the root cause for the above id is missing issue, seems like there are some records in the Mongo whose id value is blank and whenever Monstache tries to sync this record (including trace log below) to Elasticsearch, it is failing.
{"index":{"_index":"testdb.testcollection","version":6898720532628242498,"version_type":"external"}}
{"deleted":false,"id":"","oplog_date":"2020-11-24T15:59:02Z","oplog_ts":{"T":1606233542,"I":66}}
  1. Interesting thing is, any other documents that got sent to Elasticsearch by Monstache in the same batch as the above record is not getting indexed. a. Monstache is trying to resent the batch request continuously. b. New documents are getting added to this batch in the subsequent, which means any documents that are added to the batch will not get synced.
  2. As we are using resume feature, Monstache is maintaining the resume timestamp in Mongo. Resume time stamp is moving forward even though there are some records (any documents that are there in the above batch) which are not yet synced, which means when we restart Monstache it will be not able to pick up the non-synced documents.

Questions:

  1. Is there a way to make Monstache to ignore documents which does not have id’s?
  2. Is it possible to instruct Monstache to stop indexing the error data after certain number of retries?
  3. As mentioned above in point (3), in case if we restart what will be the best to resync the missed documents? a. One way I can think of is to save the error batch data back to Mongo (to new collection) so that we can try resyncing the data after reviewing (removing any error documents). b. We can try resyncing whole collection using MONSTACHE_DIRECT_READ_NS but this will be cumbersome as the documents in the batch may belongs to multiple collections?
rwynn commented 4 years ago

hi @chandra2037

  1. you can filter or transform documents via javascript or golang plugins in monstache
  2. Elasticsearch error code 400 should not be a retryable error for monstache and should get dropped after Elasticsearch responds
chandra2037 commented 4 years ago

Thank you @rwynn

  1. I will look into the filter or transforms

  2. But for some reason Monstache is retrying even for 400 errors, screenshot of errors below

image

This will go away only if I restart the Monstache.

rwynn commented 4 years ago

Yeah, it looks like the bulk request overall is failing not the individual items. Do you know why the Elasticsearch ID would be empty in the line item? This should be coming from the string form of the _id field in MongoDB which I didn't think could be empty? That is the assumption is that every MongoDB document has an _id either user generated or auto generated.

rwynn commented 4 years ago

Like you mentioned it does look like the golang client Monstache is using is looking at the response code for the entire bulk request, mapping it to an error in this case, and never clearing the bulk Items after all the retries have been exhausted.

rwynn commented 4 years ago

Maybe MongoDB allows an empty string to be the _id value of a document as long as it is unique for the collection? That one document might be causing this?

rwynn commented 4 years ago

@chandra2037 FYI I just pushed a commit to check for empty _id and report an error instead of attempting to index/delete the document.

chandra2037 commented 4 years ago

Maybe MongoDB allows an empty string to be the _id value of a document as long as it is unique for the collection? That one document might be causing this?

Good point, this might be the case.

chandra2037 commented 4 years ago

@chandra2037 FYI I just pushed a commit to check for empty _id and report an error instead of attempting to index/delete the document.

Thank you @rwynn, we are using docker version of Monstache. Can you please advise on how to get this change?

rwynn commented 3 years ago

Hi @chandra2037 can you try with version 6.7.2? The change is included.

chandra2037 commented 3 years ago

Thank you @rwynn. Really appreciate your quick responses on this issue. I will try the new version.