rwynn / monstache

a go daemon that syncs MongoDB to Elasticsearch in realtime. you know, for search.
https://rwynn.github.io/monstache-site/
MIT License
1.28k stars 181 forks source link

Records are missing in sync #694

Open sachinnagesh opened 1 year ago

sachinnagesh commented 1 year ago

Hi @rwynn ,

We are facing a very strange issue with monstache. We observed some of the records are not at all synched to elastic-index. It's happening for 5-10 records per 100 records and it's very random. This is observed in case of create and update. Also we don't see any logs at all related to records in monstache logs. Just to give you idea about out setup, we have mongodb deployment with replica set. We have multiple db's (each for a specific company - multi tenant) in deployement. From each company db we want to sync a mongodb view created on product collection.

db.createView("products-view",
"products",
[
  {
    $lookup: {
      from: "product-features",
      localField: "productid",
      foreignField: "productid",
      as: "features"
    }
  },
  {
    $lookup: {
      from: "product-technical-details",
      localField: "productid",
      foreignField: "productid",
      as: "technicals"
    }
  },
  {
    $lookup: {
      from: "product-inventory",
      localField: "productid",
      foreignField: "productid",
      as: "inventory"
    }
  }
])

e.g. dbName : company1
collections : products, product-features, product-technical-details, product-inventory dbName : company12 collections : products, product-features, product-technical-details, product-inventory

Here is monstache.toml file looks like

mongo-url = "{{ .MongoURL }}"

elasticsearch-urls =[ "{{ .Elasticsearch.URL }}" ]
{{if .Elasticsearch.Auth.Enabled }}
elasticsearch-user = "{{ .Elasticsearch.Auth.UserName }}"
elasticsearch-password = "{{ .Elasticsearch.Auth.Password }}"
{{ end }}
{{if .Elasticsearch.SSL.Enabled }}
elasticsearch-pem-file = "{{ .Elasticsearch.SSL.Path }}"
{{ end }}

direct-read-namespaces=["company1.products-view","company2.products-view" ]

change-stream-namespaces=[ '' ]
namespace-regex='^(company1|company2)\.(products|product-features|product-technical-details|product-inventory)$'
gzip = true
stats = true
index-stats = true
dropped-collections = false
dropped-databases = false
replay = false
resume = true
resume-write-unsafe = false
resume-name = "default"
resume-strategy = 0
verbose = true
exit-after-direct-reads = false
direct-read-stateful = true
elasticsearch-retry = true
prune-invalid-json = true
relate-buffer = 500000
delete-index-pattern = "*_product-detail-index"

[gtm-settings]
buffer-duration = "100ms"

## Relate Mapping for company1
[[mapping]]
namespace = "company1.products-view"
index = "company1_product-detail-index"

[[relate]]
namespace = "company1.products"
with-namespace = "company1.products-view"
keep-src = false

[[relate]]
namespace = "company1.product-features|"
with-namespace = "company1.products"
src-field = "productid"
match-field = "productid"
keep-src = false

[[relate]]
namespace = "company1.product-technical-details"
with-namespace =  "company1.products"
src-field = "productid"
match-field = "productid"
keep-src = false

[[relate]]
namespace = "company1.product-inventory"
with-namespace = "company1.products"
src-field = "productid"
match-field = "productid"
keep-src = false

## Relate Mapping for company2
[[mapping]]
namespace = "company2.products-view"
index = "company2_product-detail-index"

[[relate]]
namespace = "company2.products"
with-namespace = "company2.products-view"
keep-src = false

[[relate]]
namespace = "company2.product-features"
with-namespace = "company2.products"
src-field = "productid"
match-field = "productid"
keep-src = false

[[relate]]
namespace = "company2.product-technical-details"
with-namespace =  "company2.products"
src-field = "productid"
match-field = "productid"
keep-src = false

[[relate]]
namespace = "company2.product-inventory"
with-namespace = "company2.products"
src-field = "productid"
match-field = "productid"
keep-src = false

We also tried by setting below parameters and removing namespace-regex but still issue persist

direct-read-namespaces=["company2.products","company2.product-features","company2.product-technical-details","company2.product-inventory"]
resume-strategy = 1

We think somehow monstache missing those create/update events. We are using monstache:6.7.10

yunusemrecatalcam commented 1 year ago

Does it started to happen recently? Ours is having the same problem but we never changed the monstache config for 2 months its weird

sachinnagesh commented 1 year ago

@yunusemrecatalcam Yes we started facing issue from last 2-3 months.

sachinnagesh commented 1 year ago

@yunusemrecatalcam we found the issue from where it's coming. While fetching data from mongo view while processing relate, it doesn't get the record at all during insertion. We have mongo replica set deployment. I feel while writing data to mongo collection, there are services which are not configured with write majority. For now we have added retry mechanism (5 times) with some delay between iteration. But still there is going to be issue during update, it may not get latest updated record.

sachinnagesh commented 1 year ago

@yunusemrecatalcam I think another way to solve this is to add readPreference from primary

arcimen54 commented 6 months ago

Hi, I have a very similar problem. im using this versions:

This is my toml file

mongo-url="mongo-url?readPreference=primary"
config-database-name="database-monstache"
elasticsearch-urls =["url"]
elasticsearch-validate-pem-file=false
elasticsearch-user="user"
elasticsearch-password="password"
elasticsearch-max-conns = 50
change-stream-namespaces = [ "collection1","collection2","collection3"]
replay = false
resume = true
resume-name = "default"
index-as-update = true
direct-read-no-timeout = true
elasticsearch-retry = true
fail-fast = false
stats = false
verbose = true
disable-change-events = false
enable-patches = true
[[mapping]]
namespace = "collection1"
index = "index1"
[[mapping]]
namespace = "collection2"
index = "index2"
[[mapping]]
namespace = "collection3"
index = "index3"
[[script]]
namespace = "collection1"
script = """
module.exports = function(doc) {
  if (doc.id) { 
    doc.owner = findId(doc.owner_id, {
      collection: "collection1"
    });
  }

  function removeKey(obj) {
    Object.keys(obj).forEach(function(key) {
      if (key === "_class") delete(obj[key]);
      if (typeof obj[key] === 'object' && obj[key] !== null) {
        removeKey(obj[key])
      }
    })
  }
  removeKey(doc);

  function isNumber (value) {
  if (value === null || value === undefined) {
    return false;
  }
  if (typeof value === "string") {
    return !isNaN(value) && !isNaN(parseFloat(value));
  }
  return !isNaN(value);
  };

  if (isNumber(doc.amount)) {
    doc.amount = doc.amount * 100
  }
  if (isNumber(doc.presales_amount)) {
    doc.presales_amount = doc.presales_amount * 100
  }

  return doc;
}
"""
[[relate]]
namespace = "collection1"
with-namespace = "collection2"
src-field = "_id"
match-field = "owner_id"
keep-src = true

It's happening for 5-10 records per 100 records and it's very random exactly like @sachinnagesh reported. Have you got any suggestions?