rwynn / monstache

a go daemon that syncs MongoDB to Elasticsearch in realtime. you know, for search.
https://rwynn.github.io/monstache-site/
MIT License
1.28k stars 181 forks source link

Not indexing existing collection #251

Open erdincocak opened 5 years ago

erdincocak commented 5 years ago

Hi,

I am using: monstache --elasticsearch-url "http://localhost:9200" --mongo-url 'mongodb+srv://XXXX' --cluster-name cluster1 --direct-read-namespace db.collectionname

Indexing works only when a document in collection changes. I cant index whole collection when starting monstache. I did not set any workers btw, I dont know if it is necessary.

rwynn commented 5 years ago

Please post your Monstache config, Monstache version, and Mongodb and Elasticsearch versions.

rwynn commented 5 years ago

Also the result of adding -print-config to the command. And the output of Monstache.

erdincocak commented 5 years ago

I don't use config file. Actually I could not find it so I run monstache via command line with all settings. Monstache version: 6.0.10 Elasticsearch version: 7.2.0 Mongodb version: 4

Result of -print-config:

INFO 2019/07/05 16:35:25 { "EnableTemplate": false, "EnvDelimiter": ",", "MongoURL": "mongodb+srv://DB", "MongoConfigURL": "", "MongoOpLogDatabaseName": "", "MongoOpLogCollectionName": "", "GtmSettings": { "ChannelSize": 512, "BufferSize": 32, "BufferDuration": "75ms" }, "AWSConnect": { "AccessKey": "", "SecretKey": "", "Region": "" }, "Logs": { "Info": "", "Warn": "", "Error": "", "Trace": "", "Stats": "" }, "GraylogAddr": "", "ElasticUrls": [ "http://localhost:9200" ], "ElasticUser": "", "ElasticPassword": "", "ElasticPemFile": "", "ElasticValidatePemFile": true, "ElasticVersion": "", "ElasticHealth0": 15, "ElasticHealth1": 5, "ResumeName": "default", "NsRegex": "", "NsDropRegex": "", "NsExcludeRegex": "", "NsDropExcludeRegex": "", "ClusterName": "", "Print": true, "Version": false, "Pprof": false, "EnableOplog": false, "DisableChangeEvents": false, "EnableEasyJSON": false, "Stats": false, "IndexStats": false, "StatsDuration": "", "StatsIndexFormat": "monstache.stats.2006-01-02", "Gzip": false, "Verbose": false, "Resume": false, "ResumeWriteUnsafe": false, "ResumeFromTimestamp": 0, "Replay": false, "DroppedDatabases": true, "DroppedCollections": true, "IndexFiles": false, "IndexAsUpdate": false, "FileHighlighting": false, "EnablePatches": false, "FailFast": false, "IndexOplogTime": false, "OplogTsFieldName": "oplog_ts", "OplogDateFieldName": "oplog_date", "OplogDateFieldFormat": "2006/01/02 15:04:05", "ExitAfterDirectReads": false, "MergePatchAttr": "json-merge-patches", "ElasticMaxConns": 4, "ElasticRetry": false, "ElasticMaxDocs": -1, "ElasticMaxBytes": 8388608, "ElasticMaxSeconds": 5, "ElasticClientTimeout": 0, "ElasticMajorVersion": 0, "ElasticMinorVersion": 0, "MaxFileSize": 0, "ConfigFile": "", "Script": null, "Filter": null, "Pipeline": null, "Mapping": null, "Relate": null, "FileNamespaces": null, "PatchNamespaces": null, "Workers": null, "Worker": "", "ChangeStreamNs": [ "" ], "DirectReadNs": [ "portaldb.candidate" ], "DirectReadSplitMax": 0, "DirectReadConcur": 0, "DirectReadNoTimeout": false, "MapperPluginPath": "", "EnableHTTPServer": false, "HTTPServerAddr": ":8080", "TimeMachineNamespaces": null, "TimeMachineIndexPrefix": "log", "TimeMachineIndexSuffix": "2006-01-02", "TimeMachineDirectReads": false, "PipeAllowDisk": false, "RoutingNamespaces": null, "DeleteStrategy": 0, "DeleteIndexPattern": "*", "ConfigDatabaseName": "monstache", "FileDownloaders": 0, "RelateThreads": 10, "RelateBuffer": 1000, "PostProcessors": 0, "PruneInvalidJSON": false, "Debug": false }

Output of monstache:

INFO 2019/07/05 16:37:53 Started monstache version 6.0.10
INFO 2019/07/05 16:37:53 Successfully connected to MongoDB version 4.0.10
INFO 2019/07/05 16:37:54 Successfully connected to Elasticsearch version 7.2.0
INFO 2019/07/05 16:37:54 Listening for events
INFO 2019/07/05 16:37:54 Watching changes on the deployment
INFO 2019/07/05 16:37:54 Direct reads completed
rwynn commented 5 years ago

Thanks for the info. Output looks good, seeing direct reads complete msg. And no errors.

Assuming docs exist in portaldb.candidate and mongo user has read permission on this.

You could add -stats and -verbose just for testing. That should show all requests.

Docs from direct reads should be going to index named portaldb.candidate.

erdincocak commented 5 years ago

One thing to note, when I tried with deleting cluster-name from command string, it directly indexed over 4K docs. Then stopped again. Now only indexed docs whenever any changes occur on it. Only updated document is indexed.

I added --stats --verbose

These kind of messages return right after every document indexing: {"took":8,"errors":false,"items":[{"index":{"_index":"portaldb.candidate","_type":"_doc","_id":"432efc44-269e-40a8-908e-1b295d049ef3","_version":6710180466290327553,"result":"updated","_shards":{"total":1,"successful":1,"failed":0},"_seq_no":17,"_primary_term":3,"status":200}}]} STATS 2019/07/05 17:08:17 {"Flushed":24,"Committed":3,"Indexed":3,"Created":0,"Updated":0,"Deleted":0,"Succeeded":3,"Failed":0,"Workers":[{"Queued":0,"LastDuration":8000000},{"Queued":0,"LastDuration":0},{"Queued":0,"LastDuration":0},{"Queued":0,"LastDuration":0}]}

rwynn commented 5 years ago

You don't need cluster-name that is for high availability and requires user has write access to collection monstache.monstache.

I'm not sure I understand what you mean by stopped again. If it directly indexed 4k docs, then it seems like it is working. Direct reads only copy the collection once to Elasticsearch per run. In addition to the copy it should be listening to all changes on the cluster and syncing them (insert, modify, delete) until the process is stopped.

erdincocak commented 5 years ago

I said stopped becaues there are 280K docs inside. It indexed only 4K and it happened just once. When I try again it did not indexed just continued to index changed docs.

sometimes this error occurs in log: ERROR 2019/07/05 17:31:59 Bulk response item: {"_index":"portaldb.candidate","_type":"_doc","_id":"3d706307-d1a8-4edb-aa6f-3679dc1631c3","status":409,"error":{"type":"version_conflict_engine_exception","reason":"[3d706307-d1a8-4edb-aa6f-3679dc1631c3]: version conflict, current version [6710186809957023750] is higher or equal to the one provided [6710186809957023748]","index":"portaldb.candidate"}}

rwynn commented 5 years ago

You can ignore those errors because they mean that you already have a newer version of the doc in Elasticsearch.

https://www.elastic.co/blog/elasticsearch-versioning-support

As for the collection not entirely syncing: I'm not sure. There is another simliar issue but for 16 milliion docs. I cannot replicate the issue with partial copy. I just put 500K docs in a test collection and they all synced.

It's hard to say though cause everyone is on different versions of MongoDB. I currently have 4.0.10 in my VM.

erdincocak commented 5 years ago

Flushed":126,"Committed":7,"Indexed":9,"Created":0,"Updated":0,"Deleted":0,"Succeeded":6,"Failed":3

Flushed keeps increasing but not others and I see no change in doc count on Kibana. What does that mean? Is it about elasticsearch-max-bytes and elasticsearch-max-docs configuration?

rwynn commented 5 years ago

Flushed will increase every 5s cause that is the auto flush interval.

My output looks like this...

INFO 2019/07/05 14:46:06 Started monstache version 6.0.10
INFO 2019/07/05 14:46:06 Successfully connected to MongoDB version 4.0.10
INFO 2019/07/05 14:46:06 Successfully connected to Elasticsearch version 7.0.0
INFO 2019/07/05 14:46:06 Listening for events
INFO 2019/07/05 14:46:06 Watching changes on the deployment
INFO 2019/07/05 14:46:20 Direct reads completed
STATS 2019/07/05 14:46:36 {"Flushed":5,"Committed":10,"Indexed":388953,"Created":0,"Updated":0,"Deleted":0,"Succeeded":388953,"Failed":0,"Workers":[{"Queued":0,"LastDuration":555000000},{"Queued":0,"LastDuration":534000000},{"Queued":0,"LastDuration":176000000},{"Queued":0,"LastDuration":1352000000}]}
erdincocak commented 5 years ago

Do you give cluster-name in elasticsearch.yml? Are there any other settings in elasticsearch.yml like thread_pool.bulk_size etc.?

rwynn commented 5 years ago

Nope just running with this, both MongoDB and Elasticsearch on localhost in the VM.

monstache -direct-read-namespace test.test -stats

elasticsearch.yml is all commented out. The default settings for 7.0.0.

Are there any other settings in elasticsearch.yml like thread_pool.bulk_size etc.?

https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-7.0.html#write-thread-pool-fallback

rwynn commented 5 years ago

Not sure what it could be. If you are comfortable with Golang you can add a print statement or increment a counter at the following line in monstache.

https://github.com/rwynn/monstache/blob/rel6/monstache.go#L4182

With direct reads enabled that line should get hit for every doc in the collection in addition to any changes that you make to MongoDB.

If it is getting hit for every doc (~ 280k times in your case) then there is something wrong going on at the indexing step. If it isn’t getting a hit for every doc then the problem is reading from MongoDB.