spring-projects / spring-data-mongodb

Provides support to increase developer productivity in Java when using MongoDB. Uses familiar Spring concepts such as a template classes for core API usage and lightweight repository style data access.
https://spring.io/projects/spring-data-mongodb/
Apache License 2.0
1.62k stars 1.09k forks source link

Add an option to specify the cursor.batchSize() for repository methods returning streams. [DATAMONGO-1311] #2225

Closed spring-projects-issues closed 6 years ago

spring-projects-issues commented 8 years ago

Christian Schneider opened DATAMONGO-1311 and commented

It would be great if you provide an option to set the cursor.batchSize() for Streaming Query Results.

In case of ETL where you process a lot of GB, streaming results is already heaven on earth compared to paging. In the MongoDBCursor default implementation the batchSize is set to 0 which means the database chooses it.

In my configuration the batchSize seems to be very small. I could observe that when I fetch data from a remote database.

Java MongoDB Driver BatchSize Option

Sidenote

I couldn't verify that overriding the batchSize gives the expected performance boost


Issue Links:

Referenced from: pull request https://github.com/spring-projects/spring-data-mongodb/pull/575

2 votes, 6 watchers

spring-projects-issues commented 8 years ago

Oliver Drotbohm commented

Wondering whether it makes sense to expose a configurable CursorPreparer (which is currently already used internally) to allow tweaking the cursor setup. Do you think think these settings should should be applied globally (per MongoTemplate), per domain type, per query, per query execution?

spring-projects-issues commented 8 years ago

Sylvain LAURENT commented

Hello,

I'm also interested in having support for cursor batch size for the following case : I retrieve documents as a stream and iterate over them. For various reasons, I keep only between 15 and 40 (more or less) documents, depending on the content of each documents (and some external data, which is why I cannot directly filter in mongo). Then I close the cursor.

I noticed some bad performance because the documents are quite big and by default the first batch retrieves 100 (actually 101) documents which is a waste in my case. To answer Oliver's question, I think that such a setting should be exposed at least at the query level. Maybe at the entity level ? In all cases, this would be useful essentially for queries that stream, since queries that return a List retrieve all the documents that match

spring-projects-issues commented 8 years ago

Rob Moore commented

I am using streaming and am running into an issue currently that seems to be related to the batch size. The operation being performed for each stream result takes some time and I'm seeing errors like the following:

java.util.concurrent.CompletionException: org.springframework.dao.DataAccessResourceFailureException: Query failed with error code -5 and error message 'Cursor 43827425629 not found on server xxx:11001' on server xxx:11001; nested exception is com.mongodb.MongoCursorNotFoundException: Query failed with error code -5 and error message 'Cursor 43827425629 not found on server xxx:11001' on server xxx:11001

I believe that the problem might be resolved if we could make the batch size smaller than the server default (100, I believe) as it would keep the connection active. This thinking is motivated by a suggestion made on the mongod-user group: https://groups.google.com/forum/#!msg/mongodb-user/n1OAHPJ5FNA/oBIxevjA2ewJ

spring-projects-issues commented 8 years ago

Mark Paluch commented

Rob Moore your issue could be probably solved using smaller batch sizes or by disabling cursor timeouts, see DATAMONGO-1480

spring-projects-issues commented 8 years ago

Rob Moore commented

@mpaluch I'm not sure I follow you but I think we're on the same page. I was hoping to have the batch size option on repository methods so I could configure it in an attempt to address the issue I'm seeing

spring-projects-issues commented 8 years ago

Mark Paluch commented

That issue you described can be solved in two ways: Either decreasing the batch size to interact more often with the cursor or disabling the cursor timeout. The first will keep the cursor alive the latter will disable cursor timeouts and so your cursor will stay available no matter how long your process remains active but requires explicit resource cleaning

spring-projects-issues commented 8 years ago

Rob Moore commented

Agreed but are you suggesting this ticket is unnecessary? Are you suggesting that we break the query into smaller queries instead (that is, manage the batching outside of the repository method)?

spring-projects-issues commented 6 years ago

Mark Paluch commented

My message is a different one: Controlling the batch size allows fine grained control over chunks and fetching when consuming results through a stream. Especially when looking at reactive APIs, the batch size is derived from the subscriber demand and this can lead to a lot of getMore calls.

I'm not sure whether it makes sense to tweak defaults per repository/per query/per exection. Setting the batch size per query would be a first step and for Template API usage, we could leverage Query metadata like we already do for e.g. execution time. So a possible usage could look like:

interface PersonRepository extends CrudRepository<Person, String> {

  @Meta(batchSize = 512)
  Stream<Person> findAllBy();
}

Additionally, we could consider accepting org.springframework.data.mongodb.core.query.Meta arguments in query methods to apply query hints per-execution