Question: What are the streaming semantics of the cursor?

clarkmcc commented 4 months ago

I'm seeing extremely high memory usage in my application when I try and stream data out of MongoDB and into another database. Now I'm 99% sure this isn't a problem with this library per-se, so I've opened this issue because I want to better understand the semantics of the cursor as it implements the Stream trait.

Essentially, my application runs an aggregate to produce a Cursor<T>, it then passes that to a function that accepts a Stream<Item=Result<T, E>>, and that function calls stream.chunks(1000) processing the documents in chunks.

Now I would have expected the cursor to buffer batches of documents to some extent, but the behavior I'm seeing is that basically all the data is read from the cursor and hangs around in memory faster than I can chunk and drop it. The memory usage graph basically jumps from 0 to 1GB and then slowly trails off over time. This means that my application doesn't really see the benefits of streaming because it seems like it reads everything into memory first.

So back to my question, what are the stream semantics of the cursor (i.e. does each .next call on the stream hit the db and load the next document, or is there some sort of batching, etc...), and are there any knobs that I can fiddle with to tune this?

isabelatkinson commented 4 months ago

Hey @clarkmcc, the driver does buffer cursor documents so that a server round-trip is not required for each call to Stream::next. The default server behavior for cursors is to return 101 documents in the initial response from aggregate (or any other cursor-returning command), and then return 16 MiB worth of documents from each subsequent call to getMore, which is the server command that drivers call to buffer in further batches of documents. This means that you may have up to 16 MiB of buffered cursor documents loaded in memory at a time.

This behavior, however, is configurable with the batch_size option, which controls the number of documents returned in the aggregate and getMore responses. I recommend setting this option to a low number to reduce memory usage. Note that this may come with a performance penalty as more round-trips will be required to retrieve batches from the server.

clarkmcc commented 4 months ago

Excellent, thank you for helping me understand!

mongodb / mongo-rust-driver

Question: What are the streaming semantics of the cursor? #1169