heroku / roadmap

This is the public roadmap for Salesforce Heroku services.
195 stars 11 forks source link

Heroku Telemetry [OpenTelemetry] #214

Open elimchaysengSF opened 1 year ago

elimchaysengSF commented 1 year ago

Required Terms

What service(s) is this request for?

Heroku Platform

Tell us about what you're trying to solve. What challenges are you facing?

At Heroku, we have a long history of creating fantastic developer experiences by simplifying development down to an opinionated standardized set of tooling that “just works.” Among that tooling is our logging infrastructure, with its ability to quickly access logs via live Tailing sessions and easily export logs via Log Drains to Logging Add-on Partners.

We already use the OpenTelemetry framework inside Heroku for internal metrics, but OpenTelemetry is more than just metrics. It is the trifecta of Traces, Metrics and Logs.

We should build off of our developer experience and logging infrastructure to deliver telemetry for the Heroku platform and Heroku customers while embracing OpenTelemety and join the growing list of contributors/supporters of the standard.

Please comment below for any specific use-cases, metrics, or suggestions related to upgrading our Telemetry and Observability features to utilize the OpenTelemetry standard!

stevenharman commented 1 year ago

We also leverage OpenTelemetry (currently for Traces, and some day for Logs and Metrics). One big hole is our Traces only start once a request is "inside" Dyno. i.e., once code that we control is handling it. In the case of a Rails app, this means at the Rails internals, or maybe web server (Puma) level.

That leaves a pretty big blind spot in terms of how the request/response progressed from the Heroku infra on to our code. We'd love to close that hole.

adamlogic commented 5 months ago

As an add-on provider, here are our two big requests related to logging:

stevenharman commented 5 months ago
  • We'd love to support our customers who are using Private Space Logging, but currently add-on providers cannot create a log drain for these customers.

This one is tricky; the point of Private Space Logging is that ALL log lines for ALL Apps in the PS go to a single drain. It's then that drain's responsibility to figure out if, where, and how to send each log line. Basically, it's up to you to build your own log multiplexer. This is how we handled things internally at Heroku, anyhow.

  • We'd love to only subscribe to Heroku router logs and runtime metrics logs…

Yes, agreed! Being able to specify a set our "sources" for a drain would be 👨‍🍳💋.

elimchaysengSF commented 5 months ago

Hey Adam!

Thanks for raising this points and giving us good conversation as we go more detailed into the buildout of our OTel plans.

We'd love to support our customers who are using Private Space Logging, but currently add-on providers cannot create a log drain for these customers.

We're planning to support multiple space level drains as part of this effort. We're removing this limitation, so when we get to the point of integrating 3rd party vendors, Space Drain support should be a thing we include.

We'd love to only subscribe to Heroku router logs and runtime metrics logs. App-level logs often contain sensitive info we'd rather not ever see. It would also reduce the burden on our infrastructure if we didn't have all of this unwanted log data flowing in.

We've tangentially discussed this one related to how we segment and categorize the sources for the drains in a more detailed way than we do today.

We'd have something like the hierarchy of telemetry drains: Space -> App -> Source

Where the Source is things like the individual Formation types (web, worker, release, etc), or other streams of telemetry like the router, heroku api, 3rd and first party systems heroku postgres etc. This source list would be were you could configure and subscribe to only the drains you want and need.

teddy-materio commented 3 months ago

Our Heroku app exports its own custom OpenTelemetry metrics. We would love help running the recommended sidecar collector alongside our app. Our current approach is to use a 3rd party buildpack like this one. Heroku could help with this by:

  1. Spin up an OpenTelemetry collector automatically if our app has a valid config file - no extra buildpack needed.
  2. Help us monitor that the OpenTelemetry collector is itself healthy, working well, and not losing data. Alert us if not.
  3. Help make sure the OpenTelemetry collector stays running even while our app is shutting down. (With our current approach I'm not sure which process shuts down first: our app or the collector, so we may be dropping some metrics at shutdown time).
  4. Still let us define our own standard otelcol config file that configures which backend the OpenTelemetry collector should be forwarding the metrics to.
danielfromCL commented 3 months ago

Hi! our heroku app currently has a collector running in it's web dynos alongside our app. However, the setup was rather tedious and through trial and error. We'd love for the architecture setup to be simpler, and only have to focus on the config.yaml file. For example, we currently can't send traces from other dynos to our collector, since it's running on a different dynos, which means our setup is currently lacking on coverage. It would be really helpful if the setup for a collector for multiple dynos was simpler. Also open to any ideas as to how to make this setup work for many dynos in the meantime :) Cheers, and thanks for the help!

adudiapis commented 2 months ago

@danielfromCL What APM platform are you using?

yesmarket commented 1 week ago

Hi,

We use Elastic Cloud as our general observability platform (for logs, metrics, and APM) and we also have a managed SIEM (which is managed by a third party and is Splunk SIEM under the hood).

We need to send all logs (application, platform and activity), as well as metrics and traces to Elastic, and a subset of security related logs to Splunk.

In our AU private space, we've been running the OTEL agent as a sidecar container for our single Heroku app, but now we're looking to roll out our telemetry solution to some of our other regions that have multiple Heroku apps in their private spaces. We're thinking we'll move the OTEL agent into it's own Heroku app e.g.

Image

Anyways, interested to see where this goes...