OpenTelemetry tracing of waypoint lifecycles

Vlaaaaaaad commented 3 years ago

Is your feature request related to a problem? Please describe.

I'd love more visibility into CI/CD pipelines, and traces are the perfect way to visualize that!

Having a trace adds not only a different visualization, but enables insights into the CI/CD process. With waypoint exporting traces for each run, I could easily answer questions such as "did the build time increase for our application over the last 6 months?", "where could I best optimize my delivery pipeline?", or "do we have a higher failure rate for deploys when we also have to run the database migration step?".

Describe the solution you'd like

Ideally, I'd love to have waypoint export an OpenTelemetry trace for each waypoint up and all the underlying operations. The trace could be exported to a local cache, a file, or a remote endpoint (say DataDog or any other OpenTelemetry-compatible vendor).

Each span would have relevant details like provider_name, provider_version, values, return_status, and so on. Logs for each span would either be in-line or the field would have a link to the relevant section in the Waypoint UI.

Describe alternatives you've considered

Using logs can be done, but is a worse user experience as pipelines have to be built to convert those in relevant metrics or traces. Having structured logs with a unique job_id is pretty close to having an actual trace with spans, but losing all the advantages (visualization, ingest, and reuse of data).

Explain any additional use-cases

Adding tracing comes with some extra advantage of easier development and debugging of Waypoint itself! Both users and developers want to know if say the deploy step using Docker failed because the Docker API timed out. Both users and developers want to know if the aws plugin failed in its 25th call to the AWS API due to a rate limit.

Additional context

This feature request was already considered for Terraform, but the decision was made to wait until opentelemetry-go adds support for logs. Waypoint is different in 2 ways: it's early-stage and has its own UI. Being such an early project, without many external plugins and hooks, means the implementation impact is lower. Having a UI makes traces even more valuable: users could see them right in the UI!

As inspiration, we can look at honeycombio/buildevents which does exactly that! It's a binary that can be used in CI/CD pipelines (with support for Travis, Circle, GitHub Actions, and more) to wrap commands. During the pipeline run, data is sent to Honeycomb, leading to a final trace looking like this: 57872910-ac9eea00-77c1-11e9-8bdd-db7a870dcd61

Other traces examples can be seen on Twitter here and here.

evanphx commented 3 years ago

Hi @Vlaaaaaaad,

Thanks for the very detailed proposal! More observability around the actions being taken by waypoint is an important goal. You call out doing it in opentelemetry format, is there a particular reason you feel it needs to be in that format? I ask because we're in the phase of gathering information about this as a larger feature.

Thanks!

Vlaaaaaaad commented 3 years ago

Hey @evankerrigan!

I mentioned OpenTelemetry because it's the main open standard. I think the alternatives would be worse. Let's go through the options:

logs/metrics exposed by themselves. Maybe a Prometheus-compatible /metrics endpoint appears. I don't think this will provide enough value. As a user I'd get "average time for builds" but no other visibility.
tracing done inside waypoint. This would improve the UI but would be pretty limiting. As a user, I couldn't get those traces in my favorite tool or export them to get any insights. How would I store and archive those traces? How will I see if the CI/CD deploy times have been steadily increasing by 1% every week for the last year? This option would add a lot of complexity, without any extra value.
tracing done through adding a trace_id in logs was discussed in the first post

Having Waypoint use OpenTelemetry would ensure people can consume the data however they want, in whatever tool they want, without adding extra load on the Waypoint team! I am sure some PMs will take the data to make a case for investing in the CI/CD pipeline. Using OpenTelemetry comes with the added advantage of using their SDKs which enables trace context and baggage propagation to the Waypoint Plugins too! Waypoint will be leading an ecosystem of observable CI/CD ❤️

mitchellh commented 3 years ago

@evanphx We can also separate general application telemetry from Waypoint internal telemetry. I think the latter is much easier and is definitely something we should try to support. We use OpenTelemetry in HCP (actually, I think the precursor cause it wasn't ready when we started) so would prob make sense to keep following that for our internal stuff...

Vlaaaaaaad commented 3 years ago

As an extra data point, Jenkins also released an OpenTeletry plugin that looks great!

hashicorp / waypoint

OpenTelemetry tracing of waypoint lifecycles #1527