rwynn / monstache

a go daemon that syncs MongoDB to Elasticsearch in realtime. you know, for search.
https://rwynn.github.io/monstache-site/
MIT License
1.29k stars 182 forks source link

Indexing multiple databases into single index #433

Open batamar opened 4 years ago

batamar commented 4 years ago

I have multiple databases like

erxes[organizationId] erxes[organizationId] erxes_[organizationId]

They all have collections named customers, companies. And I am syncing all these databases into single indexes like erxescustomers, erxescompanies with the organizationId fields. When I am creating and updating, sync process is working perfectly. But When i try to delete the documents, it is looking for a wrong index and throwing following error

TRACE 2020/09/17 10:09:11 POST /_bulk HTTP/1.1
Host: elasticsearch:9200
User-Agent: elastic/7.0.18 (linux-amd64)
Content-Length: 91
Accept: application/json
Content-Type: application/x-ndjson
Accept-Encoding: gzip

{"delete":{"_index":"erxes_5efe0a832bcb0ce0bac291cc.customers","_id":"LZDpfJNmggoeMdihR"}}

TRACE 2020/09/17 10:09:11 HTTP/1.1 200 OK
Content-Length: 429
Content-Type: application/json; charset=UTF-8

{"took":0,"errors":true,"items":[{"delete":{"_index":"erxes_5efe0a832bcb0ce0bac291cc.customers","_type":"_doc","_id":"LZDpfJNmggoeMdihR","status":404,"error":{"type":"index_not_found_exception","reason":"no such index [erxes_5efe0a832bcb0ce0bac291cc.customers]","resource.type":"index_expression","resource.id":"erxes_5efe0a832bcb0ce0bac291cc.customers","index_uuid":"_na_","index":"erxes_5efe0a832bcb0ce0bac291cc.customers"}}}]}
ERROR 2020/09/17 10:09:11 Bulk response item: {"_index":"erxes_5efe0a832bcb0ce0bac291cc.customers","_type":"_doc","_id":"LZDpfJNmggoeMdihR","status":404,"error":{"type":"index_not_found_exception","reason":"no such index [erxes_5efe0a832bcb0ce0bac291cc.customers]","resource.type":"index_expression","resource.id":"erxes_5efe0a832bcb0ce0bac291cc.customers","index":"erxes_5efe0a832bcb0ce0bac291cc.customers"}}

The main error is "reason":"no such index [erxes_5efe0a832bcb0ce0bac291cc.customers]"

Here is my toml file. Thanks in advance

    mongo-url="mongodb://mongo:27017"
        elasticsearch-urls=["http://elasticsearch:9200"]
        verbose=true

        index-as-update=true
        prune-invalid-json = true
        direct-read-split-max = 1
        elasticsearch-max-bytes = 2000000
        elasticsearch-max-conns = 2
    direct-read-namespaces=["erxes_5ecba67744c4836593ada2e7.customers","erxes_5ecba67744c4836593ada2e7.companies","erxes_5efe0a832bcb0ce0bac291cc.customers","erxes_5efe0a832bcb0ce0bac291cc.companies"]
        namespace-regex = "^erxes_.+.(customers|companies)$"

        [[script]]
        script = """
        module.exports = function(doc, ns) {
            var organizationId = ns.replace("erxes_", "").replace(".customers", "").replace(".companies", "")
            var index = "erxes__companies";

            if (ns.indexOf("customers") > -1) {
                if (doc.urlVisits) {
                    delete doc.urlVisits
                }

                if (doc.trackedDataBackup) {
                    delete doc.trackedDataBackup
                }

                if (doc.customFieldsDataBackup) {
                    delete doc.customFieldsDataBackup
                }

                if (doc.messengerData) {
                    delete doc.messengerData
                }

                index = "erxes__customers";
            }

            doc._meta_monstache = { index: index };

            doc.organizationId = organizationId;

            return doc;
        }
        """
rwynn commented 4 years ago

Hi @batamar,

When you route events outside the defaults you currently have 2 ways to help monstache handle deletes.

If you have a simple 1 to 1 mapping between MongoDB collection and Elasticsearch index then you can specify that like...

[[mapping]]
namespace = "db1.col1"
index = "singleIndex"

[[mapping]]
namespace = "db2.col1"
index = "singleIndex"

If you have a 1 to many mapping between MongoDB collection and Elasticsearch index then you can solve that like ...

routing-namespaces = [ "db1.col1", "db2.col1" ]

This causes a search in Elasticsearch to find the document and delete it when a delete occurs on db1.col1, db2.col1, etc.

In either case, you currently need to know and specify all the MongoDB namespaces up front.

If you need something more dynamic you would need to use a golang plugin and implement the Process function. This lets you handle complex use cases.

rwynn commented 4 years ago

Just thought I would add, and you may already be aware of this, but Elasticsearch is flexible in it's ability to search multiple indices at once, as an alternative to collapsing many collections to one index.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html#search-search-api-path-params

Not sure if this is acceptable in your case, but you can search for a wildcard index like erxes_*.customers and that would search across all the customers across all orgs.

batamar commented 4 years ago

Just thought I would add, and you may already be aware of this, but Elasticsearch is flexible in it's ability to search multiple indices at once, as an alternative to collapsing many collections to one index.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html#search-search-api-path-params

Not sure if this is acceptable in your case, but you can search for a wildcard index like erxes_*.customers and that would search across all the customers across all orgs.

@rwynn Thank you for the response. First i was indexing to multiple indexes but i have over 4k databases which were leading to maximum shards count exceeds error.

batamar commented 4 years ago

Hi @batamar,

When you route events outside the defaults you currently have 2 ways to help monstache handle deletes.

If you have a simple 1 to 1 mapping between MongoDB collection and Elasticsearch index then you can specify that like...

[[mapping]]
namespace = "db1.col1"
index = "singleIndex"

[[mapping]]
namespace = "db2.col1"
index = "singleIndex"

If you have a 1 to many mapping between MongoDB collection and Elasticsearch index then you can solve that like ...

routing-namespaces = [ "db1.col1", "db2.col1" ]

This causes a search in Elasticsearch to find the document and delete it when a delete occurs on db1.col1, db2.col1, etc.

In either case, you currently need to know and specify all the MongoDB namespaces up front.

If you need something more dynamic you would need to use a golang plugin and implement the Process function. This lets you handle complex use cases.

@rwynn Thanks.

  1. Unfortunately, it looks like this approach can not cover newly created organizations.
  2. Is there any example of this golang plugin ?
  3. Can i write the plugin in javascript ?
batamar commented 4 years ago

@rwynn Can I use regex like following

[[mapping]]
namespace-regex = "^erxes_.+.customers$"
index = "singleIndex"
rwynn commented 4 years ago

@rwynn Can I use regex like following

[[mapping]]
namespace-regex = "^erxes_.+.customers$"
index = "singleIndex"

That might be something to consider for the future.

If you need a solution with the existing functionality give this a try:

# on a delete search for document regardless of the original ns in MongoDB
routing-namespaces = [ "" ]
# this is optional but may improve the performance of deletes by scoping the search
delete-index-pattern = "erxes_*"