reads dont complete with parallelism

googleapis / nodejs-firestore

Node.js client for Google Cloud Firestore: a NoSQL document database built for automatic scaling, high performance, and ease of application development.

https://cloud.google.com/firestore/

Apache License 2.0

643 stars 149 forks source link

reads dont complete with parallelism #2215

Open michaelAtCoalesce opened 1 week ago

michaelAtCoalesce commented 1 week ago

i have a collection with ~2000 documents, each of ~10k-20k bytes.. so maybe 10-40 megabytes.

when i submit a single get() of all documents in this collection via the node.js SDK, i get great response times. 11.54 seconds.

however, when i submit 4 at nearly the same time such that they run in parallel, i see that the reads finish such that they are barely faster than if i had submitted them sequentially. ~44 seconds.

i would expect to see, slightly slower in the multiple concurrent get() case, but not this linearly increasing behavior.

OS: mac os x
Node.js version: 20.11.1
@google-cloud/firestore version: 7.10.0

Steps to reproduce

create a firestore collection of sufficient size and documents. execute a get() on that collection, see that it completes fine as a single request. then submit with multiple concurrent get() read operations, and notice that the time is almost scaling linearly.

michaelAtCoalesce commented 1 week ago

i also tried using the readOnly:true on a transaction, didn't seem to help either. and also tried pagination. no faster.

tom-andersen commented 1 week ago

You may want to try stream(). You should receive documents as they arrive, thereby avoiding the delay. Please let us know if you experience a performance improvement by doing this.

tom-andersen commented 1 week ago

@michaelAtCoalesce The 4 requests, are they for the same query as the single request? If so, you are requesting 4 times as much data and your network might be the bottleneck. In fact, the linear scaling likely indicates that the bandwidth is being maximized.

Firestore might also be the bottleneck, in which case you should also understand scaling traffic. Firestore will dynamically add more capacity as required, but this takes time.

https://firebase.google.com/docs/firestore/best-practices#ramping_up_traffic https://firebase.google.com/docs/firestore/understand-reads-writes-scale#avoid_hotspots

michaelAtCoalesce commented 1 week ago

Yes it’s the same request. It’s not that much data (20 megabytes) so I don’t think it’s a matter of bandwidth… I’m on very fast gigabit internet on a beefy machine, I think it’s something Firestore backend related.

Is there something potentially going on with how reads occur in the backend? Doesn’t Firestore do some kind of optimistic locking on reads that might cause this kind of behavior if multiple readers of a collection are executing?

In this case, I’d be okay with an older snapshot of the data or one from a cache, as long as it was consistent. Is there a way to do that? I tried a readOnly transaction and it didn’t appear to help performance either.

tom-andersen commented 1 week ago

@michaelAtCoalesce I just noticed that you have localhost in your log output. Are you running against the emulator?

michaelAtCoalesce commented 1 week ago

@michaelAtCoalesce I just noticed that you have localhost in your log output. Are you running against the emulator?

No, live Firestore

tom-andersen commented 1 week ago

Is there something potentially going on with how reads occur in the backend? Doesn’t Firestore do some kind of optimistic locking on reads that might cause this kind of behavior if multiple readers of a collection are executing?

Your queries won't lock anything in read-only transactions, nor outside of transactions.

Optimistic concurrency doesn't use locks at all. This is what some of the other Firestore SDKs use.

This SDK will only use locks within a transaction, but much of that has been optimized away. From what I understand, you are not using any transactions? Since your test doesn't use transactions, so locks should not be a concern.

tom-andersen commented 1 week ago

In this case, I’d be okay with an older snapshot of the data or one from a cache, as long as it was consistent. Is there a way to do that? I tried a readOnly transaction and it didn’t appear to help performance either.

There is an optimization where you specify read time. By doing so, Firestore can serve data from closest replica.

See: https://firebase.google.com/docs/firestore/understand-reads-writes-scale#stale_reads

michaelAtCoalesce commented 3 days ago

update - i did another test where i had two separate processes, then submitted a parallel request through each process. it appears as though they complete in parallel well. so it appears that something that is specific to executing parallel reads in a single node process is causing this.

i'm also noticing that for a ~20mb payload, the memory goes up about 800 megabytes... (this is with the preferRest) option it may be potentially related to the fact that the memory usage for this test case goes up so quickly, that it becomes a problem. it might be worth investigating why the memory usage on a single get() call of 20 megabytes worth of data is causing a spike of 800 megabytes in memory.