Closed dmcquay closed 10 months ago
Hello @dmcquay,
Thank you for the detailed explanation. Could you please assist us in reviewing the code change in pull request #110?
@sgandhi1311 That's great! Thank you for the quick action. I'll take a look.
@sgandhi1311 are you affiliated with Stitch? Can you provide an estimate of when this enhancement will be available in Stitch?
The changes are deployed in prod. Thanks!
Quick summary of problem
In the oplog strategy when fetching the oplog entries, this for loop will trigger a cursor timeout if processing a batch of oplog entries takes too long. This cursor timeout can happen due to cursor timeout (defaults to 10 minutes) or due to a session timeout (30 minutes) which causes all cursors in that session to be removed.
Proposed solution
We need to refresh the session periodically. There is no "refresh_session" call. We just need to call something lightweight, such as a ping. We also need to create an explicit session which we'll pass into the oplog find call AND the ping call so that they are using the same session or else they will each have separate sessions and the ping will not affect the find call's session.
no_cursor_timeout=True
while you're at it to ensure we are only limited by the session timeout (default 30 minutes) and not the cursor timeout (default 10 minutes).Reproduce
I have reproduced the problem in this repo. Just follow the README.
Explanation of why the cursor times out
By default, a session is created for each query. Cursors belong to a session. When a session times out, the session becomes eligible for deletion. However, there seems to be a delay until it is actually killed. I have observed this delay to be around 7-10 minutes in my local environment. I'm going to refer to this combined session timeout + delay until session is actually killed as EFFECTIVE_SESSION_TIMEOUT.
In a production environment, SESSION_TIMEOUT will typically be 30 minutes (the docs warn against changing this) plus the delay. Let's assume around 10 minutes. So a production EFFECTIVE_SESSION_TIMEOUT is around 40 minutes give or take.
In this local project, the SESSION_TIMEOUT is set to 1 minute and the delay is around 8 minutes, so the EFFECTIVE_SESSION_TIMEOUT is ~9 mins.
When iterating on a cursor, pymongo does not communicate with the server on each iteration of the loop. Rather, it fetches data in batches. A session is "refreshed" when the next batch is fetched. A batch_size can be set on a query. It is a document count. When specified, each batch will contain exactly this many documents provided there are sufficient docs remaining to be fetched and provided that total batch does not exceed 16MB. If batch_size is not specified (tap-mongodb does not specify batch_size), then the first batch will be exactly 101 docs and all remaining batches will be a full 16 MB.
I was able to observe the actual max batch size when fetching oplog entries. I did this in a controlled environment where the oplog contained insert operations only. I inserted about a million docs. I increased logging by running
db.setProfilingLevel(0,-1)
in mongosh. I ran tap-mongodb in log mode. I then searched the mongo logs withdc logs | grep 'oplog' | grep 'nreturned'
where I was able to see that the first batch was 101 and the second was 153529.Therefore, to cause a timeout, it must take longer than EFFECTIVE_SESSION_TIMEOUT to process a batch of 153529 oplog entries (that number may of course vary based on the size of your oplog docs, but this serves as a baseline). In my local, controlled environment, it is exactly that number.
tap-mongodb can process 153529 oplog docs very fast, but it writes to stdout which is piped into another process. UNIX pipe buffers are typically 16 KB and if the consumer of the pipe isn't reading as fast, it will fill up. When the pipe buffer is full, python flush calls will block.
singer.write_message
calls flush with every print call. Therefore if the target is slow, as it can be when the source mongo docs are huge, nested, etc and the target is having to map this to a table/columnar structure. In my production environment, this is the case. I have emulated this in this local proof of concept with slow-cosumer.py.See Also