[BUG] Monstache stops tailing after a period of time.

asanepal commented 4 years ago

Describe the bug I have been using Monstache on a collection in my database. Monstache stops tailing after some period with no error in logs by Monstache. This collection has a average document size of 20KB and 50k documents in total. The oplogs generated for this collection are high in volume. When I see logs on my mongos server, I see a error "BSONObjectTooLarge" corresponding to the aggregation query made by Monstache. Subsequently I see "Resume point not in the oplog" error after some time. I am suspecting the default batch size of 512 is causing the issue in this scenario.

To Reproduce

Monstache configuration: exit-after-direct-reads = false dropped-collections = true replay = false stats = true elasticsearch-retry = true direct-read-no-timeout = true

# describe config here

I tried with using Resume and replay configs too. That didnot work either. Steps to reproduce the behavior:

Expected behavior A clear and concise description of what you expected to happen.

Software information (please complete the following information):

Operating System: Ubuntu
Monstache Version:5.0.11
MongoDB Version:4.0
Elasticsearch Version:6.8
Docker Version:

Additional context Add any other context about the problem here. I think the default batch size of 512 is generating a BSON object that exceeds 16MB. Providing a config to adjust batch size might help. Please help me with any work arounds.

Screenshots If applicable, add screenshots to help explain your problem.

rwynn commented 4 years ago

Hi @asanepal are you able to reproduce the same error with monstache v5.5.3?

Also, can you point me to 512 batch size you are referring to? I think there is currently a batch size put on direct reads but not on change streams that I am aware of.

rwynn commented 4 years ago

@asanepal I think you're right... I have the gtm channel size linked to the batch size set on the change stream. You can reduce that value by adding something like this to the end of your config file...

[gtm-settings]
channel-size = 20

asanepal commented 4 years ago

Thanks for the quick response. I haven't tried 5.5.3. 512 Batch size I could see in the mongos logs.

"monstache" command: getMore { getMore: 4010580237408541484, collection: “XXXXXX”, lsid: { id: UUID("9c12a30b-b40e-4dec-87a7-7b96fafd4475") }, $clusterTime: { clusterTime: Timestamp(1586476800, 73), signature: { hash: BinData(0, XXXXXX), keyId: 6766134630207717466 } }, $db: “XXXX” } originatingCommand: { aggregate: “XXXXXX”, pipeline: [ { $changeStream: { fullDocument: "updateLookup", resumeAfter: { _data: "825E8F7651000000462B022C0100296E5A100434E8D62137264830BE95F5C3CA31F58A463C5F6964003C6376652D323031382D31313530000004" } } } ], cursor: { batchSize: 512 }, readConcern: { afterClusterTime: Timestamp(1586476754, 224) }, lsid: { id: UUID("9c12a30b-b40e-4dec-87a7-7b96fafd4475") }, $clusterTime: { clusterTime: Timestamp(1586476754, 224), signature: { hash: BinData(0, XXXXXX), keyId: 6766134630207717466 } }, $db: “XXXX” } nShards:2 cursorid:4010580237408541484 numYields:0 ok:0 errMsg:"BSONObj size: 18645786 (0x11C831A) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: { _data: \"825E8F7653000000DF2B022C0100296E5A100434E8D62137264830BE95F5C3CA31F58A463C5F6964003C6376652D323031372D38373539000004\" }" errName:BSONObjectTooLarge errCode:10334 reslen:457 protocol:op_msg 6560ms Thanks for the work around. I will try it out.

asanepal commented 4 years ago

Hi Ryan. reducing the gtm channel to half and there by reducing batchsize to 250 didnot solve the issue. Could you help me what else I can try in this scenario?

rwynn commented 4 years ago

Hi @asanepal are you running with version v5.5.3 of monstache now which include the latest go driver from MongoDB, inc? If not can you give that a try?

asanepal commented 4 years ago

Hi Ryan, I will upgrade to v5.5.3 and give it a try. Thank you for the suggestion.

Zjcompt commented 4 years ago

@rwynn I am running into the same issue with Monstach v6.5.1

MongoDB running on version 4.0.18

ERROR 2020/04/27 13:25:09 Error starting change stream. Will retry: (ChangeStreamFatalError) resume of change stream was not possible, as the resume token was not found. {_data: "825EA341260000003B2B022C0100296E5A1004AD1D1DBAD91E41AD805C506873ABD164463C5F6964003C476F463267545947597576586179616A6F000004"} ERROR 2020/04/27 13:25:09 Error starting change stream. Will retry: (BSONObjectTooLarge) BSONObj size: 17178834 (0x10620D2) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: { _data: "825EA319DD000000072B022C0100296E5A1004AD1D1DBAD91E41AD805C506873ABD164463C5F6964003C533155654271374B7A6B4A6C3478624E35516959000004" } ERROR 2020/04/27 13:25:09 Error starting change stream. Will retry: (ChangeStreamFatalError) resume of change stream was not possible, as the resume token was not found. {_data: "825EA341260000003B2B022C0100296E5A1004AD1D1DBAD91E41AD805C506873ABD164463C5F6964003C476F463267545947597576586179616A6F000004"}

This error will get spammed endlessly in the error log. I am unsure of when the error began as it generated very very large log files. This server was running very well for a long period of time but it seems this started over the weekend for us.

Here are some of our config settings: gzip = true stats = false index-stats = false replay = false resume = true resume-write-unsafe = false index-files = false direct-read-no-timeout = true direct-read-split-max = 5 exit-after-direct-reads = false prune-invalid-json = true [gtm-settings] channel-size = 20000

A restart and resync made the issue go away for now.

rwynn commented 4 years ago

@Zjcompt thanks for the info. I have some ideas on how to address this in the next release.

The batch size of the change stream will not be tied to the channel-size setting going forward. Also, I have added more error handling so hopefully it can recover from the errors you posted.

The code has been pushed I just need to cut a new release.

rwynn commented 4 years ago

@Zjcompt @asanepal Please try monstache v6.5.4 (or v5.5.4). I put some changes in place to help with this issue. Let me know if the change stream stops and doesn't recover. Thanks!

asanepal commented 4 years ago

Hello Ryan, I tried with 6.5.4. The issue persists. I am a bit confused if this is actally related to batch size or if there is due to some other setting.

rwynn commented 4 years ago

@asanepal do you see any errors reported in monstache or mongodb logs since switching to 6.5.4?

asanepal commented 4 years ago

No errors reported by Monstache. But in Mongos logs I see the same BSONObjTooLarge error as posted above. I noticed we do have documents that re about 8-10 MB. May be doing a fulldocument in changeStream is potentially causing BSONObjTooLarge error when these big documents are present in the grouping.

george-xu-hs commented 4 years ago

Hi, any update on this? Our Monstache is also failing to recover after pausing even on v6.5.4. It was working fine for a month before suddenly pausing and not recovering a couple weeks ago. It works after restarting, but loses the resume point and will stall again several hours later.

Not seeing any errors in Monstache or Mongo other than the Pausing work for cluster foo message.

rwynn commented 4 years ago

Hi @george-xu-hs, would you be able to try the latest version v6.5.5 with cluster-name set to empty string and let me know if monstache fails to recover?

When you notice monstache get stuck can you check the MongoDB logs again to see if there are any queries periodically being sent with the appName field set to monstache (or if, when it gets stuck, there are no queries coming from monstache at all)?

george-xu-hs commented 4 years ago

Thanks @rwynn for the quick turnaround. I'm trying it now and I'll let you know how it goes.

george-xu-hs commented 4 years ago

It's been running for more than a day now and so far no problems, I'll check again over the weekend. It used to fail and not recover within several hours of running so this is promising.

Is this potentially related to running in a cluster? I removed the cluster-name like you asked and only have a single instance of Monstache running now whereas before I had two instances of Monstache running in a cluster.

rwynn commented 4 years ago

hi @george-xu-hs I suspected the cluster feature may have been failing when you mentioned the message about Pausing work.

The way it's supposed to work is that if the active process in the cluster goes down for some reason, one of the hot standby processes will enable itself. The log looks like this when the standby process realizes the active has stopped and kicks in to compensate...

INFO 2020/05/22 18:07:55 Joined cluster foo
INFO 2020/05/22 18:07:55 Pausing work for cluster foo
INFO 2020/05/22 18:08:45 Resuming work for cluster foo
INFO 2020/05/22 18:08:45 Watching changes on the deployment
INFO 2020/05/22 18:08:45 Listening for events
INFO 2020/05/22 18:08:45 Resuming from timestamp {T:1590185325 I:2}

The whole cluster feature is predicated upon the TTL index feature of MongoDB. I didn't notice this feature failing for me in a quick test. But there may be some issue you're hitting that I'm not able to reproduce.

I'm not sure how necessary this feature (clustering) is considering there are so many ways to monitor processes for health and restart or replace it. The enable-http-server flag can be used to start an internal web server in monstache to answer health check requests for example.

exentrich commented 4 years ago

After hour of research I finally found a cause of this error. In my case BSONObjectTooLarge happen because one of a documents is very very large.

I found it by running this aggregation:

db.getCollection("raindrops").aggregate([
{
    $changeStream: {
        fullDocument: "updateLookup",
        startAtOperationTime: Timestamp(1596624257, 16)
    }
}
])

In startAtOperationTime you need to put a timestamp starting from you so first BSONObjectTooLarge error. Run this aggregation again and again with incrementing startAtOperationTime until you get an error (bson object too large bla bla bla...). Now comment out fullDocument field in aggregation and run again. At this point you will get a list of documents and they ID's, one of those documents is enormously large, just travers each of them and you will find it.

In the bottom line, this error not related to Monstache itself, it can't be fixed by tuning monstache configs.

asanepal commented 2 years ago

@rwynn @exentrich updateLookup is causing the issue in my case. We have a field that is an array of embedded documents. That field alone has a bson size of above 10mb. When a new embedded document is added to that field, the corresponding changestream api response by default will have the value of the entire field under "updateDescription" field. When updateLookup is passed, the changestream API response will also include the full document under "fullDocument" field. So, we have now two fields in the changestreams API response that represent the updated field - updateDescription and FullDocument. 10MB + 10MB > 16MB. Hence the error.

rwynn / monstache

[BUG] Monstache stops tailing after a period of time. #374