open-telemetry / opentelemetry-js

OpenTelemetry JavaScript Client
https://opentelemetry.io
Apache License 2.0
2.6k stars 750 forks source link

How to properly use Opentelemetry in Google Cloud Functions #1739

Open sk- opened 3 years ago

sk- commented 3 years ago

We have been using Opentelemetry in Firebase Functions for a while now. However there's still an issue that is quite problematic.

On Google cloud functions (Firebase functions), once the request ends the instance is not killed (globals are still available), however, if one tries to use any resources like the network an error will be raised and the instance will get terminated. Hence incurring with loss of cache and cold starts in the next request.

I saw https://github.com/open-telemetry/opentelemetry-js/pull/1296 added forceFlush and shutdown, but I'm under the impression that neither are really a great fit for Google Cloud Functions.

shutdown is ruled out as then subsequent request won't generate any traces.

Force flush may work at the expense of adding an extra latency to the endpoint. However, I think it's still not guaranteed that nothing will trigger a trace.

One alternative would be to have a pause/unpause mechanism, so that at the end of the request we can pause openetelemetry, meaning that we avoid any network calls and then resume the processing upon request start, although that could miss some spans when beginning a new request. This could alternatively be done by registering an isPaused or isActive function.

Another alternative would be to have a forceDiscard method that would be similar to shutdown, but leaving the state as _isShutdown = false instead, so that further requests can still be traced.

What do guys think? Is there something I'm missing? Or do you have any recommendations on how to tackle this.

dyladan commented 3 years ago

That's a tricky one. I've not worked with firebase but in lambda there is a similar problem. In lambda, the runtime may (or may not) be suspended and there's no real way to know. Also, a suspended runtime may never wake again. Any async work prevents the function from returning, but also prevents the function from suspending. We decided to force flush on every call even though it was a slight performance penalty in order to guarantee every span was exported.

I suppose one of the things you're worried about is that spans may be created after the function completes? If that's the case you may want to create a custom pausable tracer. I would be hesitant to introduce a pausable tracer to this repo without spec approval, but I don't see any reason an alternative tracer couldn't be added to the contrib repo. This sounds like the type of thing the spec would be very interested in though.

Flarna commented 1 year ago

I think this is more a call towards the cloud providers. If they provide life cycle hooks tools like OTel can be tuned for it.