Error performing direct read of collection: Cursor already in use

darkbluestring commented 5 years ago

Hi,

I am using direct-read-ns to copy a view of documents from mongo to initialize an index in es. The view itself is a lookup between a collection and another view (itself made of a lookup between 2 other collections).

I get the above error after a few documents got successfully replicated (300 out of 9000)

ERROR 2019/05/23 17:01:14 Error performing direct read of collection saq.v_products_quantities: (CursorInUse) cursor id 5339801235002722837 is already in use

Before the cursor error I also get some document rejected by es due to field formatting issue. I doubt it has anything to do with the cursor issue ( I need to fix some documents obviously). But I mention it just in case.

ERROR 2019/05/23 17:01:18 Bulk response item: {"_index":"saq.v_products_quantities","_type":"_doc","_id":"13945062","status":400,"error":{"type":"mapper_parsing_exception","reason":"failed to parse field [sugar_content] of type [float] in document with id '13945062'","caused_by":{"reason":"For input string: \"\u003c1.2 g/l\"","type":"number_format_exception"}}}

For now I have disabled change events to focus on the initial direct read sync. I am using monstache 5.0.5, mongo 4.0.9 is replicaset mode with a single node, and elastic 6.7.2

Am I missing anything in my TOM settings to get the direct-read right ?

Thanks

My tom looks like this:

direct-read-namespaces = ["saq.v_products_quantities"] # read direct from the view of the collection to seed index

disable-change-events = true

gzip = true

verbose = false

change-stream-namespaces = ["saq.products","saq.quantities"] 

[[mapping]]
namespace = "saq.products"
index = "saq.v_products_quantities"

[[mapping]]
namespace = "saq.v_products_quantities" 
index = "saq.v_products_quantities"

[[relate]]
namespace = "saq.products"  
with-namespace = "saq.v_products_quantities"
keep-src = false 

[[relate]]
namespace = "saq.quantities" 
with-namespace = "saq.products"
src-field = "product_code_SAQ" 
match-field = "saq_code" 
keep-src = false

darkbluestring commented 5 years ago

I continued digging using the mongodb log The cursor is associated with connection conn4: At the end of the log fragment we can see that mongodb has killed the cursor on a timeout apparently (right after conn5 is created).

2019-05-23T17:00:48.359+0000 I NETWORK  [conn4] received client metadata from 172.18.0.5:60190 conn4: { driver: { name: "mongo-go-driver", version: "v1.0.2" }, os: { type: "linux", architecture: "amd64" }, platform: "go1.12.4", application: { name: "monstache" } }
2019-05-23T17:00:48.380+0000 I ACCESS   [conn4] Successfully authenticated as principal restheart on admin
2019-05-23T17:00:52.245+0000 I COMMAND  [conn4] command saq.products appName: "monstache" command: find { find: "v_products_quantities", filter: {}, projection: { id: 1 }, skip: 50000, sort: { _id: 1 }, lsid: { id: UUID("69209cbb-cc42-4606-aa23-6829c9a71aaa") }, $clusterTime: { clusterTime: Timestamp(1558630841, 1), signature: { hash: BinData(0, 64F5F4C5DA2D44A5F9FEEFCC8D7197005B51EAE5), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primary" } } planSummary: COLLSCAN numYields:355 ok:0 errMsg:"Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in." errName:Location16819 errCode:16819 reslen:354 locks:{ Global: { acquireCount: { r: 151496 } }, Database: { acquireCount: { r: 151496 } }, Collection: { acquireCount: { r: 151495 } } } storage:{ data: { bytesRead: 21388690, timeReadingMicros: 369623 } } protocol:op_msg 3831ms
2019-05-23T17:00:53.586+0000 I COMMAND  [conn4] command saq.v_products_quantities appName: "monstache" command: find { find: "v_products_quantities", filter: {}, batchSize: 1000, lsid: { id: UUID("69209cbb-cc42-4606-aa23-6829c9a71aaa") }, $clusterTime: { clusterTime: Timestamp(1558630841, 1), signature: { hash: BinData(0, 64F5F4C5DA2D44A5F9FEEFCC8D7197005B51EAE5), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primary" } } planSummary: COLLSCAN cursorid:5339801235002722837 keysExamined:0 docsExamined:1180 numYields:120 nreturned:43 reslen:16537911 locks:{ Global: { acquireCount: { r: 60915 } }, Database: { acquireCount: { r: 60915 } }, Collection: { acquireCount: { r: 60914 } } } storage:{} protocol:op_msg 1339ms
2019-05-23T17:00:56.148+0000 I COMMAND  [conn4] command saq.v_products_quantities appName: "monstache" command: getMore { getMore: 5339801235002722837, collection: "v_products_quantities", batchSize: 1000, lsid: { id: UUID("69209cbb-cc42-4606-aa23-6829c9a71aaa") }, $clusterTime: { clusterTime: Timestamp(1558630851, 1), signature: { hash: BinData(0, 6F834FFB183AFAC411C892DDEB5DF2A8EB9016E0), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primaryPreferred" } } originatingCommand: { aggregate: "products", pipeline: [ { $lookup: { from: "v_quantities_outlet", localField: "saq_code", foreignField: "product_code_SAQ", as: "quantities" } } ], collation: { locale: "simple" }, cursor: { batchSize: 1000 } } planSummary: COLLSCAN cursorid:5339801235002722837 keysExamined:0 docsExamined:0 numYields:108 nreturned:41 reslen:16535329 locks:{ Global: { acquireCount: { r: 59252 } }, Database: { acquireCount: { r: 59251 } }, Collection: { acquireCount: { r: 59250 } } } storage:{} protocol:op_msg 1405ms
2019-05-23T17:00:59.015+0000 I COMMAND  [conn4] command saq.v_products_quantities appName: "monstache" command: getMore { getMore: 5339801235002722837, collection: "v_products_quantities", batchSize: 1000, lsid: { id: UUID("69209cbb-cc42-4606-aa23-6829c9a71aaa") }, $clusterTime: { clusterTime: Timestamp(1558630851, 1), signature: { hash: BinData(0, 6F834FFB183AFAC411C892DDEB5DF2A8EB9016E0), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primaryPreferred" } } originatingCommand: { aggregate: "products", pipeline: [ { $lookup: { from: "v_quantities_outlet", localField: "saq_code", foreignField: "product_code_SAQ", as: "quantities" } } ], collation: { locale: "simple" }, cursor: { batchSize: 1000 } } planSummary: COLLSCAN cursorid:5339801235002722837 keysExamined:0 docsExamined:0 numYields:108 nreturned:40 reslen:16534015 locks:{ Global: { acquireCount: { r: 59246 } }, Database: { acquireCount: { r: 59245 } }, Collection: { acquireCount: { r: 59244 } } } storage:{ data: { bytesRead: 1661265, timeReadingMicros: 66430 } } protocol:op_msg 1850ms
2019-05-23T17:01:01.112+0000 I COMMAND  [conn4] command saq.v_products_quantities appName: "monstache" command: getMore { getMore: 5339801235002722837, collection: "v_products_quantities", batchSize: 1000, lsid: { id: UUID("69209cbb-cc42-4606-aa23-6829c9a71aaa") }, $clusterTime: { clusterTime: Timestamp(1558630851, 1), signature: { hash: BinData(0, 6F834FFB183AFAC411C892DDEB5DF2A8EB9016E0), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primaryPreferred" } } originatingCommand: { aggregate: "products", pipeline: [ { $lookup: { from: "v_quantities_outlet", localField: "saq_code", foreignField: "product_code_SAQ", as: "quantities" } } ], collation: { locale: "simple" }, cursor: { batchSize: 1000 } } planSummary: COLLSCAN cursorid:5339801235002722837 keysExamined:0 docsExamined:0 numYields:110 nreturned:42 reslen:16537203 locks:{ Global: { acquireCount: { r: 59260 } }, Database: { acquireCount: { r: 59259 } }, Collection: { acquireCount: { r: 59258 } } } storage:{ data: { bytesRead: 4162657, timeReadingMicros: 75016 } } protocol:op_msg 1321ms
2019-05-23T17:01:03.751+0000 I COMMAND  [conn4] command saq.v_products_quantities appName: "monstache" command: getMore { getMore: 5339801235002722837, collection: "v_products_quantities", batchSize: 1000, lsid: { id: UUID("69209cbb-cc42-4606-aa23-6829c9a71aaa") }, $clusterTime: { clusterTime: Timestamp(1558630851, 1), signature: { hash: BinData(0, 6F834FFB183AFAC411C892DDEB5DF2A8EB9016E0), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primaryPreferred" } } originatingCommand: { aggregate: "products", pipeline: [ { $lookup: { from: "v_quantities_outlet", localField: "saq_code", foreignField: "product_code_SAQ", as: "quantities" } } ], collation: { locale: "simple" }, cursor: { batchSize: 1000 } } planSummary: COLLSCAN cursorid:5339801235002722837 keysExamined:0 docsExamined:0 numYields:111 nreturned:43 reslen:16538085 locks:{ Global: { acquireCount: { r: 59267 } }, Database: { acquireCount: { r: 59266 } }, Collection: { acquireCount: { r: 59265 } } } storage:{ data: { bytesRead: 4429383, timeReadingMicros: 85985 } } protocol:op_msg 1904ms
2019-05-23T17:01:05.771+0000 I COMMAND  [conn4] command saq.v_products_quantities appName: "monstache" command: getMore { getMore: 5339801235002722837, collection: "v_products_quantities", batchSize: 1000, lsid: { id: UUID("69209cbb-cc42-4606-aa23-6829c9a71aaa") }, $clusterTime: { clusterTime: Timestamp(1558630861, 1), signature: { hash: BinData(0, F2CAD9EE5C04DEBF7893DF808F0CBEF223F15591), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primaryPreferred" } } originatingCommand: { aggregate: "products", pipeline: [ { $lookup: { from: "v_quantities_outlet", localField: "saq_code", foreignField: "product_code_SAQ", as: "quantities" } } ], collation: { locale: "simple" }, cursor: { batchSize: 1000 } } planSummary: COLLSCAN cursorid:5339801235002722837 keysExamined:0 docsExamined:0 numYields:112 nreturned:43 reslen:16538479 locks:{ Global: { acquireCount: { r: 59268 } }, Database: { acquireCount: { r: 59267 } }, Collection: { acquireCount: { r: 59266 } } } storage:{ data: { bytesRead: 4311920, timeReadingMicros: 72380 } } protocol:op_msg 1381ms
2019-05-23T17:01:08.189+0000 I COMMAND  [conn4] command saq.v_products_quantities appName: "monstache" command: getMore { getMore: 5339801235002722837, collection: "v_products_quantities", batchSize: 1000, lsid: { id: UUID("69209cbb-cc42-4606-aa23-6829c9a71aaa") }, $clusterTime: { clusterTime: Timestamp(1558630861, 1), signature: { hash: BinData(0, F2CAD9EE5C04DEBF7893DF808F0CBEF223F15591), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primaryPreferred" } } originatingCommand: { aggregate: "products", pipeline: [ { $lookup: { from: "v_quantities_outlet", localField: "saq_code", foreignField: "product_code_SAQ", as: "quantities" } } ], collation: { locale: "simple" }, cursor: { batchSize: 1000 } } planSummary: COLLSCAN cursorid:5339801235002722837 keysExamined:0 docsExamined:0 numYields:108 nreturned:41 reslen:16537565 locks:{ Global: { acquireCount: { r: 59252 } }, Database: { acquireCount: { r: 59251 } }, Collection: { acquireCount: { r: 59250 } } } storage:{ data: { bytesRead: 3983166, timeReadingMicros: 65680 } } protocol:op_msg 1644ms
2019-05-23T17:01:10.108+0000 I COMMAND  [conn4] command saq.v_products_quantities appName: "monstache" command: getMore { getMore: 5339801235002722837, collection: "v_products_quantities", batchSize: 1000, lsid: { id: UUID("69209cbb-cc42-4606-aa23-6829c9a71aaa") }, $clusterTime: { clusterTime: Timestamp(1558630861, 1), signature: { hash: BinData(0, F2CAD9EE5C04DEBF7893DF808F0CBEF223F15591), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primaryPreferred" } } originatingCommand: { aggregate: "products", pipeline: [ { $lookup: { from: "v_quantities_outlet", localField: "saq_code", foreignField: "product_code_SAQ", as: "quantities" } } ], collation: { locale: "simple" }, cursor: { batchSize: 1000 } } planSummary: COLLSCAN cursorid:5339801235002722837 keysExamined:0 docsExamined:0 numYields:108 nreturned:44 reslen:16540100 locks:{ Global: { acquireCount: { r: 59270 } }, Database: { acquireCount: { r: 59269 } }, Collection: { acquireCount: { r: 59268 } } } storage:{ data: { bytesRead: 4423552, timeReadingMicros: 94969 } } protocol:op_msg 1243ms
2019-05-23T17:01:12.899+0000 I COMMAND  [conn4] command saq.v_products_quantities appName: "monstache" command: getMore { getMore: 5339801235002722837, collection: "v_products_quantities", batchSize: 1000, lsid: { id: UUID("69209cbb-cc42-4606-aa23-6829c9a71aaa") }, $clusterTime: { clusterTime: Timestamp(1558630869, 1), signature: { hash: BinData(0, 87F1CA25427DAC6116257684EFACA2F767DD6129), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primaryPreferred" } } originatingCommand: { aggregate: "products", pipeline: [ { $lookup: { from: "v_quantities_outlet", localField: "saq_code", foreignField: "product_code_SAQ", as: "quantities" } } ], collation: { locale: "simple" }, cursor: { batchSize: 1000 } } planSummary: COLLSCAN cursorid:5339801235002722837 keysExamined:0 docsExamined:0 numYields:108 nreturned:46 reslen:16544776 locks:{ Global: { acquireCount: { r: 59282 } }, Database: { acquireCount: { r: 59281 } }, Collection: { acquireCount: { r: 59280 } } } storage:{ data: { bytesRead: 4121227, timeReadingMicros: 121862 } } protocol:op_msg 2027ms
2019-05-23T17:01:14.391+0000 I NETWORK  [listener] connection accepted from 172.18.0.5:60216 #5 (3 connections now open)
2019-05-23T17:01:14.391+0000 I NETWORK  [conn5] received client metadata from 172.18.0.5:60216 conn5: { driver: { name: "mongo-go-driver", version: "v1.0.2" }, os: { type: "linux", architecture: "amd64" }, platform: "go1.12.4", application: { name: "monstache" } }
**2019-05-23T17:01:14.418+0000 I ACCESS   [conn5] Successfully authenticated as principal restheart on admin**
**2019-05-23T17:01:14.455+0000 I COMMAND  [conn4] Timed out obtaining lock while trying to gather storage statistics for a slow operation**
2019-05-23T17:01:14.455+0000 I COMMAND  [conn4] command saq.v_products_quantities appName: "monstache" command: getMore { getMore: 5339801235002722837, collection: "v_products_quantities", batchSize: 1000, lsid: { id: UUID("69209cbb-cc42-4606-aa23-6829c9a71aaa") }, $clusterTime: { clusterTime: Timestamp(1558630869, 1), signature: { hash: BinData(0, 87F1CA25427DAC6116257684EFACA2F767DD6129), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primaryPreferred" } } originatingCommand: { aggregate: "products", pipeline: [ { $lookup: { from: "v_quantities_outlet", localField: "saq_code", foreignField: "product_code_SAQ", as: "quantities" } } ], collation: { locale: "simple" }, cursor: { batchSize: 1000 } } planSummary: COLLSCAN cursorid:5339801235002722837 numYields:26 ok:0 **errMsg:"operation was interrupted" errName:CursorKilled errCode:237 reslen:238 locks:{ Global: { acquireCount: { r: 12299 } }, Database: { acquireCount: { r: 12296 } }, Collection: { acquireCount: { r: 12295 } } } protocol:op_msg 578ms**
2019-05-23T17:01:14.456+0000 I NETWORK  [conn4] end connection 172.18.0.5:60190 (2 connections now open)

darkbluestring commented 5 years ago

The best way to solve an issue is to post one ;-) The issue seems to have nothing to do with monstache. I can reproduce the time out on the view by doing a query with the same sort condition monstache is using {_id:1}. I need to fix this slow view ! Sorry for the noise and thanks again for that great software !! I will close this issue.

rwynn commented 5 years ago

Hi @darkbluestring glad you were able to track down the problem. You may want to try the setting:

direct-read-split-max = -1

That should omit the query with the _id sort on it and instead just try to use 1 cursor no sort over the entire view.

The split stuff is where monstache first attempts to break the collection/view into segments. It does this with an _id projection and _id sort and skip. The result of these upfront queries is a series of segments which are basically { startId, [endId] }. With those segments it then reads each segment concurrently using a $gt and $lt query over _id with no sort or skip.

If you set the above it disables the splitting and you just get a single cursor that reads over the entire view.

darkbluestring commented 5 years ago

Thanks I will try that. BTW, do you have any way in monstache to set the MaxTimeMS value to increase the timeout on the query monstache use to read the view. It is a driver side setting and it is my understanding that there is no default server side value in mongo that can be set and neither did I find a way to change the default value on the globalsign/mgo driver that monstache use. It appears it must be set on a per request basis using https://godoc.org/github.com/globalsign/mgo#Query.SetMaxTime Curently the driver seems to default to 5s as does mongodb compass.

darkbluestring commented 5 years ago

I tried the setting you suggested but I get the same error

direct-read-split-max = -1

2019-05-23T19:19:29.247+0000 I COMMAND  [conn36] Timed out obtaining lock while trying to gather storage statistics for a slow operation
2019-05-23T19:19:29.247+0000 I COMMAND  [conn36] command saq.v_products_quantities appName: "monstache" command: getMore { getMore: 4638572615318218049, collection: "v_products_quantities", batchSize: 1000, lsid: { id: UUID("4d65d28e-b088-47e4-af04-b9c4c7a5722e") }, $clusterTime: { clusterTime: Timestamp(1558639162, 1), signature: { hash: BinData(0, 951F7B7756054A4E693238F721A842F8EB04E6FD), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primaryPreferred" } } originatingCommand: { aggregate: "products", pipeline: [ { $lookup: { from: "v_quantities_outlet", localField: "saq_code", foreignField: "product_code_SAQ", as: "quantities" } } ], collation: { locale: "simple" }, cursor: { batchSize: 1000 } } planSummary: COLLSCAN cursorid:4638572615318218049 numYields:24 ok:0 errMsg:"operation was interrupted" errName:CursorKilled errCode:237 reslen:238 locks:{ Global: { acquireCount: { r: 11835 } }, Database: { acquireCount: { r: 11832 } }, Collection: { acquireCount: { r: 11831 } } } protocol:op_msg 407ms

rwynn commented 5 years ago

@darkbluestring with monstache v5 and v6 I moved off of the mgo driver and to the new golang driver from MongoDB. The driver is configured completely via the URL and you can check all the things you can set here: https://github.com/mongodb/mongo-go-driver/blob/v1.0.2/x/network/connstring/connstring.go

I'm not sure if there is a better place to link to where these are all documented.

darkbluestring commented 5 years ago

Thanks for the information. It appears the driver supports all the mongo connexion URI options which is great. Unfortunately the mongodb uri format does not support to set the default cursor time out. It has to be provided as part of the call to collection.find (it looks like it is the MaxTime parameter in the FindOptions struct that is provided as parameter to the collection.find method) https://github.com/mongodb/mongo-go-driver/blob/v1.0.2/mongo/options/findoptions.go https://github.com/mongodb/mongo-go-driver/blob/v1.0.2/mongo/collection.go

rwynn commented 5 years ago

For SetMaxTime specifically I think you would need to use https://github.com/mongodb/mongo-go-driver/blob/v1.0.2/mongo/options/findoptions.go#L101 https://github.com/mongodb/mongo-go-driver/blob/v1.0.2/mongo/options/findoptions.go#L310

And with monstache that would need to be set across a bunch of queries in the following place to affect direct reads: https://github.com/rwynn/gtm/blob/leafny/gtm.go https://github.com/rwynn/gtm/blob/leafny/gtm.go#L1345 https://github.com/rwynn/gtm/blob/leafny/gtm.go#L1041

But like you said I don't see a default for this and looks like it needs to be set on each query.

darkbluestring commented 5 years ago

Yes we are on the same line.

darkbluestring commented 5 years ago

I suppose it could be added as a new config parameter. Something like 'direct-read-default-timeout'.

rwynn commented 5 years ago

Yeah, I can look into that. It's weird that the error message is "cursor in use". I think that might be related to how I am using a go context timeout on the call to Next() on the cursor. If that timeout is reached (10s) then Next() unblocks and I try to call Next() again unless something external is trying to stop operation (like someone trying to shut down the process).

https://github.com/rwynn/gtm/blob/leafny/gtm.go#L1060 https://github.com/rwynn/gtm/blob/leafny/gtm.go#L1090

I just use that context timeout as a way to get back control instead of having Next() block forever waiting for data. But maybe you cannot use the cursor properly after that context times out.

rwynn commented 5 years ago

@darkbluestring any chance you could try a local build of monstache from the rel6 branch? I just made some changes there to remove a context timeout for direct reads. I just want to check if maybe that fixes this problem without having to add a new maxTime setting.

Let me know if you need any help building it. Should be as easy at install golang 1.12 and go run monstache.go.

darkbluestring commented 5 years ago

Hi @rwynn,

I sure will! somewhere today as soon as I have a bit of free time. Also do you want me to reopen the issue to track the change ? Thanks

rwynn commented 5 years ago

@darkbluestring I'll reopen the issue. I want to try to understand the "Cursor already in use" error. I can see it in the list of MongoDB error codes but I can't find any documentation about what exactly that means. Trying to understand if the client is causing the issue by issuing more than one command at a time.

darkbluestring commented 5 years ago

I've tested the rel6 locally.

Not seeing the "cursor already in use" error in monstache logs anymore. No more time out in mongodb logs neither !

But we are getting another weird behavior. The elastic server appears to freeze after a bunch of bulk upload. Initially I got an "out of memory" error so I increased the es heap size to 4G instead of 1G. No more out of memory error afterwards but now elastic does not respond on port 9200 after some time. (from kibana, curl, or any other http client). Obviously it is an issue on the elasticsearch server side but monstache does not seems to report any error on its side and appears to get frozen as well.

I must do more digging on ES side to understand why ES refuse connection on 9200...

Also, but not related, I have tried to remove the direct-read-split-max = -1 to restore the {_id:1} sort attempt. In that case, I get the exact same error I have from the mongo shell test script: "Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in". Definitely an issue with the performance of the view and a limitation of mongodb view config that does not support allowDiskUse setting. One must wrap the call to the view in an aggregate([]) to work around that issue: https://stackoverflow.com/questions/41149573/use-createview-with-allowdiskuse

But what troubles me is that I get that message only on the mongo server log, nothing in monstache log . It appears the error is non fatal to monstache which continue to sync the view (and eventually freeze as described above). Maybe a warning in the monstache logs would make sense ?

2019-05-24T17:27:43.137+0000 I COMMAND  [conn3] command saq.products appName: "monstache" command: find { find: "v_products_quantities", filter: {}, projection: { id: 1 }, skip: 50000, sort: { _id: 1 }, lsid: { id: UUID("702cdaa6-c2d0-4a82-87e7-19a0d2b44862") }, $clusterTime: { clusterTime: Timestamp(1558718853, 1), signature: { hash: BinData(0, B35B4C954D8E8E5AF569CC9A3D8C79E2DD48CDE0), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primary" } } planSummary: COLLSCAN numYields:351 ok:0 errMsg:"Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in." errName:Location16819 errCode:16819 reslen:354 locks:{ Global: { acquireCount: { r: 151492 } }, Database: { acquireCount: { r: 151492 } }, Collection: { acquireCount: { r: 151491 } } } storage:{ data: { bytesRead: 21478585, timeReadingMicros: 343384 } } protocol:op_msg 3481ms

darkbluestring commented 5 years ago

I fixed my 9200 connection issue. nothing to do with monstache. I am resuming my testing of rel6 with setting

direct-read-split-max = -1

darkbluestring commented 5 years ago

Ok now monstache runs fine until it brutally terminate after 3mn of work without any error msg in the log.

mongodb-log.txt

Maybe an out of memory issue... The process reached 5.6Gb (out of 10.5Gb available, the rest being allocated to ES and mongo)

INFO 2019/05/24 18:40:43 Started monstache version 6.0.5
INFO 2019/05/24 18:40:43 Successfully connected to MongoDB version 4.0.9
INFO 2019/05/24 18:40:43 Successfully connected to Elasticsearch version 6.7.2
INFO 2019/05/24 18:40:43 Listening for events

rwynn commented 5 years ago

I'm not sure what to make of that memory usage of monstache. By default it has 4 go routines pushing to Elasticsearch and they buffer up to 8mb each. Once they get flushed that memory should be getting garbage collected.

If you run with stats=true and stats-duration="5s", do you notice if the queued count keeps rising? Or does that remain constant or up/down?

darkbluestring commented 5 years ago

Thanks I will try that.

In the meantime, something intrigues me but I could be wrong: in function DirectReadPaged, line 1297, I would have expected the projection to be on field "_id" instead of "id" as monstache does all the segmentation and operation construction with the "_id" field. And also I don't think there is any guarantee that the collection has a "id" field.

https://github.com/rwynn/gtm/blob/leafny/gtm.go#L1297

pro := bson.M{"id": 1}

Not sure it has anything to do with my issue though.

rwynn commented 5 years ago

@darkbluestring good catch on id. I think I messed that up when porting from https://github.com/rwynn/gtm/blob/development/gtm.go by hand. Original version did use _id correctly.

darkbluestring commented 5 years ago

So the queued count keep rising from 20 to 132 until the process reaches the fatal 5.6G threshold.

One stupid question: I 've realized rel6 is compiled against elastic 7.0. I'am using 6.7.2. Could that explain why everything looks like the data never reach elastic. I cannot see the index being created in elastic.

rwynn commented 5 years ago

@darkbluestring I've fixed it if you want to pull rel5/rel6 again.

rwynn commented 5 years ago

yes, thats probably it. you need rel5 branch for es6. rel6 branch for es7. they are the same for the most part, except for API changes in es. And for es7 monstache no longer sends the type field which was required in all previous versions of Elasticsearch.

https://www.elastic.co/guide/en/elasticsearch/reference/master/removal-of-types.html

I need to get some time to update the docs with the compat chart. Sorry.

darkbluestring commented 5 years ago

nothing to be sorry about. My bad I used the wrong rel. With rel5 the memory is now fully back to normal level (below 1Gb ) without any twists in the setting. I am resuming my testing.

darkbluestring commented 5 years ago

OK conclusion of the testing: It all works fine, NO MORE TIME OUT in mongodb log and no more "cursor already in use'" in the monstache log (With and without direct-read-split-max = -1) The whole process takes 5mn on my testing machine and the memory used by monstache remains below 1G. When not setting direct-read-split-max = -1 is still get the no fatal error on mongodb log because of the sort {_id:1}. But monstache continue its processing defaulting to a single segment apparently.

If you want we could investigate how to solve that as well to enable segmentation and improve performance but it is another issue. The idea would be to detect that it is a view and not a collection and wrap it with aggregate([],AllowDiskUse:true) Let me know if you are interested. Thanks again for your help and swift fixing on this issue !

2019-05-24T22:02:36.619+0000 I COMMAND  [conn31] command saq.products appName: "monstache" command: find { find: "v_products_quantities", filter: {}, projection: { _id: 1 }, skip: 50000, sort: { _id: 1 }, lsid: { id: UUID("7610460f-552c-4bc2-928b-c0dec6074364") }, $clusterTime: { clusterTime: Timestamp(1558735350, 1), signature: { hash: BinData(0, 591F14551E1F09942DA54466D874C854A0638C6D), keyId: 6682788118359179266 } }, $db: "saq", $readPreference: { mode: "primary" } } planSummary: COLLSCAN numYields:358 ok:0 errMsg:"Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in." errName:Location16819 errCode:16819 reslen:354 locks:{ Global: { acquireCount: { r: 151499 } }, Database: { acquireCount: { r: 151499 } }, Collection: { acquireCount: { r: 151498 } } } storage:{} protocol:op_msg 4416ms

rwynn commented 5 years ago

To get around sort issue for views maybe it would be best to change the findOne call to an aggregate query. That should work for both views and collections and remove the need to detect view.

Monstache already has config option for allowDisk. So if that is true then it would be added to the aggregate options.

rwynn commented 5 years ago

@darkbluestring whenever you get around to it, if you could do a pull and then put the following in your config ...

pipe-allow-disk = true
direct-read-split-max=9

That should hopefully get the sort working on your view and then maybe better performance on the direct reads by splitting.

darkbluestring commented 5 years ago

Hi @Rwynn/Monstache,

I did the test with the fresh pull and the suggested settings. It works as we don't get the error in the log due to the sort {id:1} taking too much memory.

Thanks for that !!!

However, with my view, the performance is worst: 12mn instead of 5mn ( I ran the test multiple times to have stable perf metrics)

I was curious to understand why:

the new aggregate query in DirectReadPages takes now 5mn - 6mn itself,
plus the 5mn-6mn needed by the DirectReadSegments to get the docs like in the previous version.

I took a look at the code to try to better understand what was going on and there are a few things about my understanding that I would like to validate with you:

Since collStats command is not supported on a view (see the screenshot from mongo shell) , segmentSize will always be 50 000 for a view. So if the view contains less document than 50 000, the split-max setting has no effect on such view. it would be equivalent to split-max=1 or split-max=-1. But since we set the split-max setting to a positive value we still execute the aggregate query DirectReadPaged and therefore pay the cost of the aggregation which, in my case, appears to be very huge.

So I suppose, in my use case, Split-Max = -1 is optimal because I have less that 50 000 docs in my view and the cost of the aggregation is prohibitive.

Do you think my understanding is correct ?

rwynn commented 5 years ago

hi @darkbluestring thanks for the feedback after your testing. Based on your comments I took another look and made some more changes. I actually forgot to put the $sort stage into the aggregate call so I corrected that. At this point I think I'm probably done making changes unless you see a major issue. I'll probably just make a new monstache release with the progress we've made thus far.

Anyway, I think maybe not splitting might (like you said) be the most optimal way to read this particular view. I looked over the pipeline that monstache is generating and re-read the aggregate docs and I think it should be using the _id index and thus be pretty fast. However, an aggregate query on a view is complicated because a view is an aggregation pipeline to begin with and there may be some optimizations that might be applied to the view itself that could speed things up.

https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/

rwynn commented 5 years ago

You may want to look into adding indexes on any lookup fields in your view definition if you haven’t already. It sometimes helps to run an explain on the view just to make sure it is using the indexes.

darkbluestring commented 5 years ago

Hi @rwynn , Thanks for the last update and the advices. I indeed have work to do to optimize that view.

Looking forward to release 5.0.6 with the recent changes.

rwynn commented 5 years ago

Hi @darkbluestring just released a new version. For the optimization you might try the following:

use saq;
db.v_products_quantities.aggregate( [ { $match: {} }, { $sort: {_id: 1} }, { $skip: 1000 }, { $limit: 1 } ], {explain: true} )

This mimics what monstache would do when trying to find segments. In the explain results you want to see IXSCAN and not COLLSCAN. The latter is expensive.

If you see COLLSCAN you might be able to add indexes on the underlying collections of the view until you see only IXSCAN in the explain results winning plan.

sakthiskv commented 5 years ago

ERROR 2019/07/02 18:31:25 Bulk response item: {"_index":"vendor_prod_mgmt.motor_restriction","_type":"_doc","_id":"5cd40327c86a97d6a045d86f","status":400,"error":{"type":"mapper_parsing_exception","reason":"failed to parse field [YEAR] of type [long]","caused_by":{"reason":"For input string: \"2,01,72,018\"","type":"illegal_argument_exception"}}} TRACE 2019/07/02 18:31:31 POST /_bulk HTTP/1.1

when i going to transfer bulk data mongo to elasticsearch am getting "illegal_argument_exception". the data type in mongodb is string,here elasticsearch the type was automatically change to [long]

sakthiskv commented 5 years ago

how to resolve this

rwynn commented 5 years ago

Can you create a separate issue instead of adding comments to this one?

This error usually happens when you have a mix of strings and numbers in your mongodb data for the same field. Elasticsearch gives the field the type of the first value sent and subsequent values are expected to be the same type. You need to map each field to a single data type in Elasticsearch.

sakthiskv commented 5 years ago

i have bulk amount of data. how to map each field??

rwynn / monstache

Error performing direct read of collection: Cursor already in use #225