Need guidance on use of preferRest; FIRESTORE_PREFER_REST is now being set to true by default in firebase functions causing preferRest to be default rather than opt-in

sbhvt commented 10 months ago

After our last round of deployments to firestore functions, we started to see a big increase in execution times in our functions (about double) and started to get frequent ECONNRESET errors in all newly deployed functions.

I did a lot of troubleshooting including rolling back our code to the code from prior deployments and nothing fixed our issue. I saw this issue: https://github.com/firebase/firebase-admin-node/issues/2243 about someone having ECONNRESET issues after enabling preferRest, and I turned on setLogFunction for firestore to see if I could see what was going on. I saw this in the logs:

Firestore (6.7.0) 2023-11-13T12:18:33.460Z epLhF [ClientPool.acquire]: Creating a new client (requiresGrpc: false) 
Firestore (6.7.0) 2023-11-13T12:18:33.559Z ##### [clientFactory]: Initialized Firestore GAPIC Client (useFallback: rest)

I redeployed with {preferRest:false} and immediately the execution times returned to their previous times. Obviously our code performs better with this setting as false, so explicitly setting it to false is at least a short-term resolution for our needs.

I looked through release notes and noticed this: https://github.com/googleapis/nodejs-firestore/pull/1848; so I added a console.log for the FIRESTORE_PREFER_REST env variable and, sure enough, it's explicitly set to true in our environments. But we have set this nowhere in our code or in any configuration.

My questions are twofold:

1) when did the functions environments start defaulting FIRESTORE_PREFER_REST environment variable? I will log a support ticket to Google for this, since it seems not related to this package, but I wanted to note it here for awareness, especially if anyone else is wasting cycles on tracking this down like I was. As of a few weeks ago, this setting must not have been defaulting to true because we only saw the performance issues arise after our most recent deployment (Nov 7); as I mentioned, I tried redeploying the exact same code (including all sdk library versions) from a prior deployment and saw performance issues that we did not see with last deployment. In both cases the entire code was the same, we were using version 6.7.0 of this library. If this environment variable was turned on in all new deployments, I would have expected to see this in release notes somewhere. I cannot find any record of this. This functionality is documented as opt-in behavior, but setting the env variable by default makes it opt-out, so surely that should be communicated. A lot of time was spent on this; it's not a normal expectation to redeploy the same exact code and have a performance hit without any clues about infrastructure changes that could be the reason behind this. 2) what is the recommendation for use of preferRest moving forward? As I mentioned the performance degradation was really large for us. Do you have information on what types of workloads you'd expect performance improvements versus which would expect performance degradation?

davidkychen commented 10 months ago

We aer encountering a similar issue.

In the recent update to version 12.8.0 of w9jds/firebase-action, which is using the 12.8.0 version of firebase-tool, the environment variable FIRESTORE_PREFER_REST is being automatically set to true. We've observed a critical issue where this change is causing some cloud functions (interact with Firestore with multiple reads), which are hosted through Firebase hosting, to get triggered multiple times (observed at least 10 times more than expected).

This behavior of multiple triggers does not occur when the function is invoked directly through its original Cloud Function trigger URL. However, the issue arises when the function is called via the Firebase hosting route.

A temporary workaround we found is to unset the FIRESTORE_PREFER_REST variable or using an older version to deploy, which resolves the issue of multiple triggers.

MarkDuckworth commented 10 months ago

@sbhvt, Sorry to hear this change has caused performance degradation for you. preferRest was introduced to reduce function cold start times associated with loading and initializing the grpc stack. There's nothing wrong with choosing to set preferRest: false in the long run. We've had reports of perferRest helping customers, but I can't recall one that has show serious performance degradation. If possible, we'd like to learn more about your workloads to help us understand why you might be experiencing this.

Also, I forwarded this issue with the cloud functions team so they can follow up on questions in item 1.

ehsannas commented 10 months ago

We've had reports of perferRest helping customers

Interesting data point: https://github.com/googleapis/nodejs-firestore/issues/1862 suggested that they were getting significantly better performance when using preferRest.

sbhvt commented 10 months ago

@MarkDuckworth thanks so much for forwarding that along.

@sbhvt, Sorry to hear this change has caused performance degradation for you. preferRest was introduced to reduce function cold start times associated with loading and initializing the grpc stack. There's nothing wrong with choosing to set preferRest: false in the long run.

Ok, that's what we will plan to do for now to get our performance times back to the previous.

We've had reports of perferRest helping customers, but I can't recall one that has show serious performance degradation. If possible, we'd like to learn more about your workloads to help us understand why you might be experiencing this.

Here's a breakdown of background on what we've seen:

Cold starts aren't really a concern for us; most of the functions in question have provisioned minimum instances and consistently have multiple instances running, those which do not have minimum instances are those we already deemed cold starts to be acceptable for (i.e. where execution time is not concern). Our bigger concern is ongoing execution time.
we don't have load-testing in our lower environments and obviously don't have the same level of steady traffic in dev/test, so I could only really see concrete measurable impacts against historical memory and execution times in our production environment, but even in the lower environments, we have started to frequently get the ECONNRESET errors in functions that were deployed in the last week. ECONNRESET issues were found for functions across ALL types of triggers (http, chron, firestore triggers, pubsub).
for all functions there was a visible impact to memory utilization: much higher and in most cases more variable. Incidentally, I upgraded node 16 to node 18 and saw an even worse impact on memory, but this was before I had identified the root cause so that's something I'd want to repeat again with preferRest off to be more confident with saying it was an issue.
some of our functions have pretty high variation of execution times while others have pretty steady execution times; for the functions that had historically steady execution times, it was very easy to see exactly when the impacts began and when they ended (when I set preferRest to false); for those it with variable times, I couldn't necessarily correlate the measurable difference in execution times.
the functions with the measurable execution time difference are all ones where we have multiple read queries to Firestore, pulling back only 1-2 records; the document size is pretty large but the number of records in the response is not.
I can't say whether we have seen an impact in functions which were not read-heavy workloads simply because this particular deployment happened to touch mostly functions that were reads only or functions with writes which already had highly variable execution times.

For reference here's a visual of the difference in execution times for one of our functions between when we first encountered the issue to when I finally set preferRest to false. This particular function had pretty steady request volume and, as you can see, steady execution times (showing the 50 and 95 percentiles). As you can see it more than doubled execution time during the period it was running with FIRESTORE_PREFER_REST enabled. Other functions saw similar execution time changes that were introduced exactly when the most recent deployment was made.

The function above is an http function that averages between 2-3 active and 2-3 idle instances consistently.

Is there any specific guidance you'd recommend based on what I've shared above? Or should we just proceed with disabling rest?

sbhvt commented 10 months ago

@MarkDuckworth additional info I forgot to mention above: when I enabled firestore logging, I could clearly see that the issue was with the increase in time from send request to receive response. The time itself differed depending on the query but for the same query, the time was much higher with preferRest, sometimes more than double. I did, however, indeed see an improvement in time to initialize the first client (105 ms without rest, 40 with rest), but that was more than offset by the increase in each request time, plus since we have active instances running clients are reused most requests.

taeold commented 10 months ago

@sbhvt Engineer from Cloud Functions for Firebase here.

Thanks for detailing your setup and pointing out the issue with the recent update. Our intention with changing the default value for preferRest to improve the performance of initializing and using the Firestore client for majority of our users, but clearly, it didn't pan out that way for everyone. Sorry about the hassle it caused – that wasn't our plan.

We're pulling back on making preferRest: true the default. We're going to dig deeper into the issue you've surfaced and, if it makes sense, we might reintroduce it in a future update – but only through a major bump in the firebase functions SDK.

Your feedback was incredible helpful. We're on track to fix this and will let you know once it's sorted.

sbhvt commented 10 months ago

@taeold Thanks, appreciate the reply. Let me know if there's any more info you need.

IchordeDionysos commented 10 months ago

We are also facing a similar issue.

Our usage is an API that reads data from Cloud Firestore and returns it to the client. We are using Cloud Functions V2 (so running on Cloud Run with concurrency enabled).

Most of the requests will hit a running instance.

inlined commented 10 months ago

Hey all, I used to be the Firebase tech lead for Cloud Functions and this feels like the ECONNRESET errors that plagued GCF 1st gen back in the beginning. Lots of SDKs in the world weren't ready for serverless environments and Google's SDKs were no better.

The fundamental problem was due to connection pooling/keepalives. While connection pools are great at reducing latency by reducing TLS handshakes, each connection must be kept alive. During hibernation, there is sometimes not enough CPU for the connection pool to send or process the keep alive and the server closes the connection. Unfortunately, part of Google infrastructure swallows the FIN packet to avoid a RST storm attack and the client doesn't know the connection is dead until next use. I would recommend two mitigations to these issues:

Wrap all idempotent (E.g. GET/LIST/PUT) operations in a single retry in the case of ECONNRESET. Unfortunately you cannot know if any bytes were sent in a POST request so you cannot retry those.
Clean up idle connections after a timeout rather than just relying on TCP keepalive.

DevSchmidtchen commented 5 months ago

Is there any update on this? I'm also facing this issue.

googleapis / nodejs-firestore

Need guidance on use of preferRest; FIRESTORE_PREFER_REST is now being set to true by default in firebase functions causing preferRest to be default rather than opt-in #1943