open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
218 stars 142 forks source link

Proposed Attribute Conventions for CI/CD #915

Open adrielp opened 2 months ago

adrielp commented 2 months ago

Area(s)

area:new, area:cloudevents, area:deployment

Relates to #832 #833

Is your change request related to a problem? Please describe.

This is an issue being opened for broader discussion within the CI/CD Working Group and Semantic Conventions WG to gauge direction on the addition of conventions for CI and CD. This proposal details out at a moderately level the direction to evolve our support of CloudEvents and leverage extensions in order to define the exact attributes we support, where they come from in the community, etc.

Describe the solution you'd like

Below are an incomplete set extended attributes for subject.type for CI/CD. These attributes primarily come from v0.3.0 of CDEvents.

The CDEvents specification is broken down into multiple sections:

An example of what CDEvents bound to CloudEvents looks like can be found here and is copied below.

POST /sink HTTP/1.1
Host: cdevents.example.com
ce-specversion: 1.0
ce-type: dev.cdevents.taskrun.started.0.1-draft
ce-time: 2018-04-05T17:31:00Z
ce-id: A234-1234-1234
ce-source: /staging/tekton/
ce-subject: /namespace/taskrun-123
Content-Type: application/json; charset=utf-8
Content-Length: nnnn

{
   "context": {
      "version": "0.3.0",
      "id" : "A234-1234-1234",
      "source" : "/staging/tekton/",
      "type" : "dev.cdevents.taskrun.started",
      "timestamp" : "2018-04-05T17:31:00Z",
   }
   "subject" : {
      "id": "/namespace/taskrun-123",
      "type": "taskRun",
      "content": {
         "task": "my-task",
         "url": "/apis/tekton.dev/v1beta1/namespaces/default/taskruns/my-taskrun-123"
         "pipelineRun": {
            "id": "/somewherelse/pipelinerun-123",
            "source": "/staging/jenkins/"
         }
      }
   }
}

Each one of these subjects, would be associated with a predicate which is what happens to the subject in an occurrence. For example, taskRun would be followed by started. This does need more conversation around timestamps. Based on one of the WG, one of the key questions was surrounding start & stop times. Because of the nature of event predicates in CDEvents and the event definitions for Eiffel, events denote what type they are (ie. start / finished) and have corresponding timestamps the event was created. Due to the nature of distributed tracing with regards to the CloudEvents specification, this shouldn't conflict with the current tracing specification.

An example event workflow within a CI system may look like this: image

The example CI system above would send event data over OTLP with the attribute examples listed above. I've leaned towards these attributes for these reasons:

Describe alternatives you've considered

Eiffel could be made to extend CloudEvents just like CDEvents, which would enable choice selection and interoperability between conventions. Trace propagation will occur then as per the CloudEvents spec defined in opentelemetry with the addition of attributes aligning with CI/CD.

Additional context

The one currently identified divergence between CloudEvents Distributed Tracing and CI/CD systems in the method of propagation. This is for the traceparent, which can be propagated within CI/CD systems to provide inter-process context propagation. Environment carries as context and baggage propagators is going to be key for batch systems like CI to be able to emit events with correct lineage.

Current outstanding thoughts and concerns:

jundai-godaddy commented 2 months ago

Is the idea that these are just for describing the CD events themselves? or is there a way that we'll be able to use these to also describe the built artefact in, say, production logs? I would love to have semantics for things like branch, sha, build-time for everything relating to the deployed artefact, and it strikes me that there's a lot of overlap.

trask commented 2 months ago

a possible tie-in to the OpenTelemetry profiling work could be to define CI/CD semantic conventions for sending symbols for native frames, which backends could then use when rendering profiles

adrielp commented 2 months ago

Is the idea that these are just for describing the CD events themselves?

@jundai-godaddy That's the current state of CDEvents as I understand them, yes. The intent here is to leverage the current state, and iterate/add such that this metadata would be included in metrics, logs, and traces within OpenTelemetry so that you would have those semantic in place.

a possible tie-in to the OpenTelemetry profiling work could be to define CI/CD semantic conventions for sending symbols for native frames, which backends could then use when rendering profiles

@trask that's good to know! thanks for that callout.

christophe-kamphaus-jemmic commented 2 months ago

What is the strategy for existing tools as listed on the CICD WG project page (GitHub Actions Receiver, Jenkins Opentelemetry plugin, …)? Can their attributes be covered by those from CDEvents or would additional attributes need to be defined? Which parts of their attributes should be covered by OTel SemCon, while the remaining attributes would be specific to a given tool?

In any case when OTel SemConv defines CICD conventions, tools that want to follow OTel SemConv will most likely have to adapt. The question is how much effort this will be.

adrielp commented 2 months ago

@christophe-kamphaus-jemmic that's a great question that I just don't have a complete answer to. I would think that part of that answer would have to come from the maintainers of each of the tools based on their decision to support the convention. I'm sure this migration wouldn't be trivial, and the CDEvents conventions aren't all encompassing. Questions like yours are the reason I wanted to open this as an issue first instead of just pushing a pull request with attributes.

For example, looking at the githubactionsreceiver attributes mirrors what GitHub provides in terms of event metadata which doesn't match CDEvents. Most of those attributes would be translatable, but CDEvents doesn't have an attribute for sha or login from what I see. However, anything beyond CDEvents, like it's parent CloudEvents, would be addable through the customData field.

Another example would be the Jenkins work. From a cursory glance at their code base, their semantic attributes could be updated to reflect these attributes. jenkins.pipeline.step.id would translate to cdevents.taskRun.id and then things like jenkins.computer.name could show up in the customData field.

I think adoption of existing things isn't going to be easy, but I don't think it's going to be ridiculously hard either. What I'd like to see is an easy way to translate/map, add missing attributes (either via extension or directly in OTEL), and extend (I think both CDEvents & CloudEvents data/customData fields already hit the mark).

On the note of translation, I'm concerned about explicitly adopting CDEvents and excluding Eiffel. From a mapping perspective, I'd love to see make the interoperability and support easy between the two so a common language can be found.

I'd be curious to hear your thoughts on that as well @e-backmark-ericsson @afrittoli

e-backmark-ericsson commented 2 months ago

@adrielp , I'm thrilled to read this issue and the conversation in it! I'm a maintainer both of the Eiffel event protocol and of the CDEvents event protocol. We've been looking for a way to interact/connect between those event protocols and OTel for some time now.

Both Eiffel and CDEvents aim at solving mostly the same needs, and those include both interoperability and observability. I believe that an event-driven architecture, with a distributed and broadcasting event system with publish/subscribe functionality, is superior to achieve interoperability between components/services in a CI/CD setup compared to a point-to-point oriented architecture. For the observability use case on the other hand, I believe that solutions like OTel could play a crucial role. And that's the use case that I think most people involved in OTel are primarily concerned about.

Eiffel and CDEvents can provide observability of a CI/CD system, but as we know that many of the tools involved in CI/CD also has the possibility to propagate data over OTLP it would be great to find a way to connect these two worlds. I envision that Eiffel and/or CDEvents would provide observability of the top-level of a CI/CD pipeline, or the full SDLC, which could include events notifying about new requirements or bugs, through source change commits, PRs, pre-merge builds&tests, production builds, component tests, system tests, deployments, rolling upgrades and beyond. And for the tools/systems involved in this process, that are capable of emitting OTel data, there should be a way to relate that OTel data to Eiffel/CDEvents data and vice versa.

One important aspect of observability, is the ability to observe the full pipeline live. One use case is that developers pushing code to CI/CD would like to know how far their changes has come in the full CI/CD flow, for example using a live 'follow-your-commit' visualization. To handle that I believe it's crucial to be able to notify about activities (pipeline steps) being put in queue or started, and not just about finished activities. That also enables the overall system to take action on certain steps that take longer than expected, even if the service executing those steps has no timeout functionality in itself. I currently don't see that OTel can handle that use case, but I'd be glad to be informed otherwise.

My knowledge and understanding of OTel is still too limited to clearly see the right way forward when it comes to how to relate OTel data with Eiffel/CDEvents data, but I hope we'll manage to sort that out as part of this OTel CI/CD observability initiative.

kuisathaverat commented 2 months ago

Some of the attributes, IMO need to be more generi. They are too tied to how Tekton defines a CI/CD workflow; if you are not familiar with Tekton, that naming is weird/odd.

For example, Jenkins uses job/pipeline, stage, and step, GitHub workflow, job, and step, GitLab pipeline, stage, job, and steps.

The attributes are plain. I mean, all are at the same level. There is no grouping (pipeline, test, scm, ...). We saw in other Opentelemetry conventions the benefits of grouping the attributes by categories; it makes the search and the implementation of transformers to other kinds of data, graphs, and so on easier. If the attributes are not well categorized at the beginning, it causes continuous refactoring and breaking changes.

I suggest to change to something like:

Finally, I miss attributes referencing the Agent/runner/worker where the build takes place. Most of that info would go as system info not related to CI/CD, but we need a way to relate an Agent/runner/worker with a build

afrittoli commented 2 months ago

Some of the attributes, IMO need to be more generi. They are too tied to how Tekton defines a CI/CD workflow; if you are not familiar with Tekton, that naming is weird/odd.

  • taskRun # from Core Events
  • pipelineRun # from Core Events

For example, Jenkins uses job/pipeline, stage, and step, GitHub workflow, job, and step, GitLab pipeline, stage, job, and steps.

Names are different all over the place indeed, we collected a lot of different names from different tools as part of the CDF SIG Interoperability work.

The TaskRun and PipelineRun names are identical to the Tekton ones, but they are meant to be generic. We picked those because they let us distinguish between execution and definition. Many CI systems let users define, update, and delete their task/steps/pipelines/workflow definitions through a certain process, and execute them through a different one. We don't have events today for the process of creating/editing/deleting pipelines but we might have them in future.

We took a similar approach for tests and test suites - events for their executions are TestRun and TestSuiteRun.

Regardless of the name we pick for the standard, to help with adoption we could documenting how that name maps to the tool specific name.

The attributes are plain. I mean, all are at the same level. There is no grouping (pipeline, test, scm, ...). We saw in other Opentelemetry conventions the benefits of grouping the attributes by categories; it makes the search and the implementation of transformers to other kinds of data, graphs, and so on easier. If the attributes are not well categorized at the beginning, it causes continuous refactoring and breaking changes.

I'm not sure I understand this, but I may be lacking OpenTelemetry context.

We have groups of events in CDEvents, each group hosts several subjects, each subject has several predicates, each subject/predicate combination has several attributes - however those attributes are also grouped at subject level and can sometimes be referenced across events too.

For example change.id from the content of an artifact.packaged matches the subject.id in change.* events. And the user attribute from artifact.downloaded matches the user attribute from the artifact.deleted event.

I suggest to change to something like:

  • pipeline

    • step
  • build

    • artifact
  • test

    • suit

    • case

      • output
  • scm

    • branch
    • repository
    • change
  • deployment

    • environment
    • service
  • incident

Finally, I miss attributes referencing the Agent/runner/worker where the build takes place. Most of that info would go as system info not related to CI/CD, but we need a way to relate an Agent/runner/worker with a build

There is definitely space for adding more attributes to the existing schema. We took the minimalistic approach of adding new attributes as needed, when we have a use case for them. I can definitely see info about the agent/runner/worker to belong to the build events. Feel free to create an issue about that, we could include the new field in the next release if we agree on the format.

afrittoli commented 2 months ago

Is the idea that these are just for describing the CD events themselves? or is there a way that we'll be able to use these to also describe the built artefact in, say, production logs? I would love to have semantics for things like branch, sha, build-time for everything relating to the deployed artefact, and it strikes me that there's a lot of overlap.

CDEvents includes several of what we call "subjects", such as change (for SCM events), build, artifact, testRun, outputs, tickets, incidents, environment, service and more. Each subject has several attributes, see artifact for instance. The current list of attributes is not meant to be comprehensive, we may add more attributes (and subjects) as needed for interoperability purposes.

We can have individual discussions about each specific proposed attribute. From your list:

afrittoli commented 2 months ago

@christophe-kamphaus-jemmic that's a great question that I just don't have a complete answer to. I would think that part of that answer would have to come from the maintainers of each of the tools based on their decision to support the convention. I'm sure this migration wouldn't be trivial, and the CDEvents conventions aren't all encompassing. Questions like yours are the reason I wanted to open this as an issue first instead of just pushing a pull request with attributes.

For example, looking at the githubactionsreceiver attributes mirrors what GitHub provides in terms of event metadata which doesn't match CDEvents. Most of those attributes would be translatable, but CDEvents doesn't have an attribute for sha or login from what I see. However, anything beyond CDEvents, like it's parent CloudEvents, would be addable through the customData field.

CustomData is certainly an option. The field is meant mainly for tool/vendor-specific data, but it can also be used as an ad-interim home for attributes that are in the process of being added to the CDEvents. We took a minimalist approach and several events only have the bare minimum needed to get started, and the project is open to have new fields added as long as they are meaningful across a number of tools.

Another example would be the Jenkins work. From a cursory glance at their code base, their semantic attributes could be updated to reflect these attributes. jenkins.pipeline.step.id would translate to cdevents.taskRun.id and then things like jenkins.computer.name could show up in the customData field.

I think adoption of existing things isn't going to be easy, but I don't think it's going to be ridiculously hard either. What I'd like to see is an easy way to translate/map, add missing attributes (either via extension or directly in OTEL), and extend (I think both CDEvents & CloudEvents data/customData fields already hit the mark).

+1

On the note of translation, I'm concerned about explicitly adopting CDEvents and excluding Eiffel. From a mapping perspective, I'd love to see make the interoperability and support easy between the two so a common language can be found.

It may be possible to map CDEvents to corresponding Eiffel events (and vice-versa), maybe an adapter SDK would be sufficient to solve the dilemma? The CDEvents community has been working in the past few years on driving the adoption of CDEvents in tools - building on top of that and combining efforts with the OTEL community may be a good recipe to further foster adoption.

I'd be curious to hear your thoughts on that as well @e-backmark-ericsson @afrittoli

v1v commented 2 months ago

I'm not very familiar with the cost of storing attribute names, but if reducing these 3 bytes in the attribute names implies some space-saving, I wonder whether the Run suffix could be also removed to be even more tool-agnostic:

Furthermore, IIUC, testOutput won't distinguish stderr/stdout so, I might suggest something like:

That's similar to what was identified here.

kuisathaverat commented 2 months ago

The TaskRun and PipelineRun names are identical to the Tekton ones, but they are meant to be generic. We picked those because they let us distinguish between execution and definition. Many CI systems let users define, update, and delete their task/steps/pipelines/workflow definitions through a certain process, and execute them through a different one. We don't have events today for the process of creating/editing/deleting pipelines but we might have them in future.

They are not really generic to me. The Run part does not add any value. I'd accept pipeline and task as more generic. The same is for the Run part in the test, which does not apport any value; we are defining the execution of something in the CI/CD, and we know it is a Run already.

The attributes are plain. I mean, all are at the same level. There is no grouping (pipeline, test, scm, ...). We saw in other Opentelemetry conventions the benefits of grouping the attributes by categories; it makes the search and the implementation of transformers to other kinds of data, graphs, and so on easier. If the attributes are not well categorized at the beginning, it causes continuous refactoring and breaking changes. I'm not sure I understand this, but I may be lacking OpenTelemetry context. We have groups of events in CDEvents, each group hosts several subjects, each subject has several predicates, each subject/predicate combination has several attributes - however those attributes are also grouped at subject level and can sometimes be referenced across events too. For example change.id from the content of an artifact.packaged matches the subject.id in change.* events. And the user attribute from artifact.downloaded matches the user attribute from the artifact.deleted event.

I am talking about changes in fields; these changes usually break all the historical data you have in some way, as an example here, you have one of the latest changes in the JVM fields for java instrumentation . I am trying to say that choosing the right fields in the right hierarchy is critical; on those decisions, people will build all their apps, graphs, UI, and processes, ...

afrittoli commented 2 months ago

The TaskRun and PipelineRun names are identical to the Tekton ones, but they are meant to be generic. We picked those because they let us distinguish between execution and definition. Many CI systems let users define, update, and delete their task/steps/pipelines/workflow definitions through a certain process, and execute them through a different one. We don't have events today for the process of creating/editing/deleting pipelines but we might have them in future.

They are not really generic to me. The Run part does not add any value. I'd accept pipeline and task as more generic. The same is for the Run part in the test, which does not apport any value; we are defining the execution of something in the CI/CD, and we know it is a Run already.

One might want to build automation associated with changes to the definitions of Pipelines and Tasks, and the Run part allows one to distinguish that case. That said, I agree that the most common use case by far is the execution, so the Run part could be considered redundant. I will open an issue on CDEvents to discuss about dropping the Run from both test and pipeline/task events.

The attributes are plain. I mean, all are at the same level. There is no grouping (pipeline, test, scm, ...). We saw in other Opentelemetry conventions the benefits of grouping the attributes by categories; it makes the search and the implementation of transformers to other kinds of data, graphs, and so on easier. If the attributes are not well categorized at the beginning, it causes continuous refactoring and breaking changes. I'm not sure I understand this, but I may be lacking OpenTelemetry context. We have groups of events in CDEvents, each group hosts several subjects, each subject has several predicates, each subject/predicate combination has several attributes - however those attributes are also grouped at subject level and can sometimes be referenced across events too. For example change.id from the content of an artifact.packaged matches the subject.id in change.* events. And the user attribute from artifact.downloaded matches the user attribute from the artifact.deleted event.

I am talking about changes in fields; these changes usually break all the historical data you have in some way, as an example here, you have one of the latest changes in the JVM fields for java instrumentation . I am trying to say that choosing the right fields in the right hierarchy is critical; on those decisions, people will build all their apps, graphs, UI, and processes, ...

Yeah, I totally agree, changes to fields may break historical data. We've been doing our best to pick the right fields and hierarchy on CDEvents side, however there is no perfect one as some structures might fit better for certain tools than others. We expect there'll be a bit of churn in the beginning and plan to switch to 1.x releases once initial adoption gives us confidence that we won't have to do backwards incompatible changes.

magnusbaeck commented 2 months ago

I feel I need to go back to the issue description that started all this.

The example CI system above would send event data over OTLP with the attribute examples listed above.

This, together with a few other things being said, sounds like OTel spans are basically made into carriers of CDEvents events. If so, why?

choosing one specification over the other, excluding a potential portion of the user base

I'm not sure we'd have to make that choice. Shouldn't we aim at defining attributes that stand on their own, taking mere inspiration from prior art? Regardless of what choice we make, mappings to e.g. event protocols will have to be made and they will be imperfect.

jsuereth commented 2 months ago

A few thoughts here:

  1. Distributed Tracing (in OTEL) is mostly designed around understanding microservice architecture. That is, Spans are defined with a start/end time known for a few reasons:
    • Measuring latency of tasks/operations is one of the key use cases
    • You never have to deal with "missing end event" use cases, so you can always calculate latency/timing info.
    • We assume "short lived" tasks/operations where keeping knowledge of a span in memory and sending on complete is reasonable.
  2. Given CI/CD may have long time to complete, it's understandable why CDEvents went a different direction. There was talk WAY back of allowing otel to operate this way.
  3. You are correct that context propagation is the main common component you need here.

As such my recommendation would be to do the following in order:

  1. Make sure you can use OpenTelemetry API/SDK propagation to generate CDEvents.
    • make sure the context propagation component is solid (you can continue you ENV variable work).
    • For now, you can likely encode CDEvents in our "event API"/protocol as a prototype/experiment.
  2. Define bidirectional compatibility between OTEL (Events/Spans) <-> CDEvents.
    • I would preserve the semantics of CDEvents as defined. I think the start/stop event solution is likely more amenable to CI/CD workflow use case.
    • You could provide a stateful-ish collector that can convert form CDEvents <-> Otel spans here.
kuisathaverat commented 2 months ago

Distributed Tracing (in OTEL) is mostly designed around understanding microservice architecture. That is, Spans are defined with a start/end time known for a few reasons: Measuring latency of tasks/operations is one of the key use cases You never have to deal with "missing end event" use cases, so you can always calculate latency/timing info. We assume "short lived" tasks/operations where keeping knowledge of a span in memory and sending on complete is reasonable.

From the experience we have using the OpenTelemetry Jenkins plugin for the last three years or so, Distributed tracing fits well to represent the execution of CI/CD pipelines, it helps to chain spans from different tools (CI, maven, pytest, Ansible,...) and see the whole picture of the execution in a CI/CD context.

Given CI/CD may have long time to complete, it's understandable why CDEvents went a different direction. There was talk WAY back of allowing otel to operate this way.

I am unsure if it is relevant; we are talking about CI/CD in general, not a particular implementation of how to represent the execution of a CI pipeline. CDEvents is nice as a starting point for getting ideas, but what matters is the representation of the information received more than how it is sent.

You are correct that context propagation is the main common component you need here.

There is a long discussion about how to propagate the context. Most of the implementations use an environment variable to pass the context between applications, that context is used to configure distribute tracing in Otel on each tools.

afrittoli commented 2 months ago

Distributed Tracing (in OTEL) is mostly designed around understanding microservice architecture. That is, Spans are defined with a start/end time known for a few reasons: Measuring latency of tasks/operations is one of the key use cases You never have to deal with "missing end event" use cases, so you can always calculate latency/timing info. We assume "short lived" tasks/operations where keeping knowledge of a span in memory and sending on complete is reasonable.

From the experience we have using the OpenTelemetry Jenkins plugin for the last three years or so, Distributed tracing fits well to represent the execution of CI/CD pipelines, it helps to chain spans from different tools (CI, maven, pytest, Ansible,...) and see the whole picture of the execution in a CI/CD context.

Given CI/CD may have long time to complete, it's understandable why CDEvents went a different direction. There was talk WAY back of allowing otel to operate this way.

I am unsure if it is relevant; we are talking about CI/CD in general, not a particular implementation of how to represent the execution of a CI pipeline. CDEvents is nice as a starting point for getting ideas, but what matters is the representation of the information received more than how it is sent.

The representation of the information is the key aspect of CDEvents as well. Interoperability is achieved with CDEvents through consistency of data representations across tools.

The CDEvent spec defines how CDEvents are transported over the network in the CloudEvents binding, which describes how a CDEvent can be transported in a CloudEvents payload. The CloudEvent binding is separate from the core CDEvent spec by design. CDEvents could in future define an OTEL binding that describes how a CDEvent can be sent using OTEL SDKs.

You are correct that context propagation is the main common component you need here.

There is a long discussion about how to propagate the context. Most of the implementations use an environment variable to pass the context between applications, that context is used to configure distribute tracing in Otel on each tools.

kuisathaverat commented 2 months ago

The representation of the information is the key aspect of CDEvents as well. Interoperability is achieved with CDEvents through consistency of data representations across tools. The CDEvent spec defines how CDEvents are transported over the network in the CloudEvents binding, which describes how a CDEvent can be transported in a CloudEvents payload. The CloudEvent binding is separate from the core CDEvent spec by design. CDEvents could in future define an OTEL binding that describes how a CDEvent can be sent using OTEL SDKs.

Semantic Conventions define a common set of (semantic) attributes which provide meaning to data when collecting, producing and consuming it.

At this point, we are not talking about how information is transported or the rules for interoperating between systems.

pablochacin commented 2 months ago

The TaskRun and PipelineRun names are identical to the Tekton ones, but they are meant to be generic. We picked those because they let us distinguish between execution and definition. Many CI systems let users define, update, and delete their task/steps/pipelines/workflow definitions through a certain process, and execute them through a different one. We don't have events today for the process of creating/editing/deleting pipelines but we might have them in future.

They are not really generic to me. The Run part does not add any value. I'd accept pipeline and task as more generic. The same is for the Run part in the test, which does not apport any value; we are defining the execution of something in the CI/CD, and we know it is a Run already.

@kuisathaverat Maybe I'm missing something here, but I would expect the specification to allow me to associate an event in the CI/CD with the execution of a pipeline (a pipelineRun) and task (a taskRun). That way, I can list all executions of the same pipeline or all executions of the same task (including across multiple pipelines). This is what I see reflected in the cdevents core spec

kuisathaverat commented 2 months ago

@kuisathaverat Maybe I'm missing something here, but I would expect the specification to allow me to associate an event in the CI/CD with the execution of a pipeline (a pipelineRun) and task (a taskRun). That way, I can list all executions of the same pipeline or all executions of the same task (including across multiple pipelines). This is what I see reflected in the cdevents core spec

Recently, Opentemetry added the concept of event; it is experimental. The specification to map an event already exists. We have to enrich the CI/CD part. To trigger CDEvents you must define how to fill the event fields, but I think that is something particular for CDEvent mapping/implementation in OpenTelemetry more than semantic conventions about entities and information related to CI/CD in general. OpenTelemetry is a vendor/implementation agnostic solution, or at least try to be agnostic.

adrielp commented 1 month ago

Thanks for all the feedback here, this has been a really good conversation. Based on the feedback I think the following things should be true:

I think there's going to be a balance between the use of events -> converting to spans vs emitting spans with SpanEvents. Some attributes defined may more appropriately show up in Events instead of Spans.

I think a good example of this would be the incident attribute. I see two main places this attribute could be emitted.

In both scenarios, it would impractical to try and build a span, across what might be multiple systems. Especially since it would be an unknown amount of time between the incident created and closed event. In cases like this, I see Events being ideal.

I'm thinking the general attributes themselves should not be beholden to the Signal or means of propagation, though it's certainly important to think about.

Based on the above conversation, I think the new set of common attributes might look like this, if we want them to be OpenTelemetry defined and specific, yet mappable.

Attribute Type Description Example
pipeline.name string The name of the pipeline build_go_project
pipeline.id string The id of the pipeline 1220987
pipeline.step.name string The name of the step within a pipeline golang lint, go build, go test
pipeline.step.type enum build / test / deployment build, test
pipeline.step.id string The id of the step 1029907097
pipeline.step.runner.name string Name of the runner ubuntu runner
pipeline.step.runner.id string Id of the runner 12987
pipeline.step.runner.system() System attributes according to system SemConv system.os.name etc (these would be the semconv for system in OTEL)
build.artifact.name string The name of the build artifact myprojects-go-binary
build.artifact.id string The id of the built artifact 1280742109
build.artifact.version string A hopefully semantic version of the build artifact v0.1.0
build.artifact.sha string the sha of the atrifact 38090de7003fca23ae70365623071808ea073ff6657521a54952d87189d5d092
deployment.name string The name of the deployment My Deployment
deployment.id string The id of the deployment 10927140197
deployment.environment.name string The name of the environment deploying to production
deployment.status string Success / Failure success
test.suite.name string The name of the test suite go tests
test.suite.id string The id of the test suite 124124
test.suite.case.name string The name of the test case TestMyFunction
test.suite.case.id string The id of the test case 124091284
test.suite.case.status string The status of the test case (success / failure) success
scm.repository.name string The name of the repository x-wing-design-plans
scm.repository.ref.name string The name of the ref (branch) main, new-lazers
scm.repository.change.name string The name of the change (like pull request / merge request) pull-request, merge-request
scm.repository.ref.commit.sha string The sha of the commit for the repository ref (ref can be trunk) 5418eb7892214450a40b129e06c5ecd308884cd9023c681917602474ee6498e1
incident.name string The name of the incident Rebels attacking the death star
incident.id string The ID of the incident 03982
incident.severity string The severity of the incident Critical

Underneath this set of attributes, others could come. For example, an incident event may have an additional set of attributes that make the event look like this:

{
  "body": {
    "issue_id": 1243,
    "issue_name": "Rebels attacking the death star",
    "issue_url": "https://example.com/issue/1243"
  },
  "attributes": {
    "incident.name": "Rebels attacking the death star",
    "incident.id": 1243,
    "incident.severity": "Critical",
    "incident.created": "<created timestamp>",
    "incident.closed": "<resolved timestamp>"
  }
  ...
}

In this case the created and closed attributes fall under the Event attributes and not the common attributes.

Additionally, we'll be able to identify where these attributes map to CDEvents/Eiffel attributes as well as pointing directly to some of them as supplementary in the context of Events, etc. That could be in some cases events within Spans, or Events emitted outside of spans due to the nature of the system

This leaves me with three questions:

Also, if you're curious, I'm already leveraging Events from GitHub for DORA metrics. The events come into the WebHook OTEL receiver, run through the transform processor and end up looking like this (for deployments).

{
  "body": {
    "deployment": {
      "created_at": "2024-05-16T17:00:36Z",
      "environment": "development",
      "id": 1518950571,
      "ref": "my-feature-branch",
      "sha": "f29f5f4b306dbc961ea3ce76ff884931471ec4b6",
      "task": "deploy",
      "updated_at": "2024-05-16T17:02:37Z",
      "url": "https://api.github.com/repos/org/repo/deployments/1518950571"
    },
    "deployment_status": {
      "environment": "development",
      "state": "success",
      "url": "https://api.github.com/repos/org/repo/deployments/1518950571/statuses/3744913701"
    },
    "repository": {
      "full_name": "org/repo",
      "name": "repo",
      "owner": {
        "login": "org"
      }
    },
    "workflow": {
      "name": "Build/Push/Test Repo Docker Image",
      "path": ".github/workflows/build-test-push.yml",
      "url": "https://api.github.com/repos/org/repo/actions/workflows/96982907"
    }
  },
  "attributes": {
    "repository.name": "repo",
    "repository.owner": "org"
  },
  "instrumentation_scope": {
    "name": "otlp/webhookevent",
    "version": "1.0.0",
    "attributes": {
      "receiver": "webhookevent",
      "source": "webhookevent"
    }
  }
}

Right the attributes you're not seeing are turned into labels (because of Loki conventions) but I think the byproduct of having a common convention, with signal specifics is going to enable wider adoption with methods already in play.

jundai-godaddy commented 1 month ago

At first pass I assumed that build.artifact.sha was the (e.g., git) sha that the artifact was built from, but the existence of scm.repository.ref.commit.sha makes me think that it might be something different? Is it meant to be a checksum of the built artifact? Or is it sometimes the same and sometimes different (up to the interpretation of person implementing it)?

adrielp commented 1 month ago

@jundai-godaddy - My intent was the checksum of the artifact. I scm.repository.ref.commit.sha would also be included alongside that build.artifact metadata.

cyrille-leclerc commented 1 month ago

Hello, this is Cyrille, I co-maintain the Jenkins OTel Plugin with @kuisathaverat and I also maintain the OTel Maven Extension. I'm very excited by the proposal. Regarding adopting those emerging semantic conventions to what is in place in the Jenkins OTel Plugin and the OTel Maven Extension, @kuisathaverat and I would have to check with the Maven and Jenkins communities but I'm not worried because:

  1. we have already experienced a few changes of the OTel Semantic Conventions like the http ones, we know how to include feature flags,
  2. we didn't get complaints during our previous changes, I guess mostly because attributes in traces are mostly read by humans

Regarding the question of modeling pipeline executions as events or as traces, @jsuereth is right that OTel traces are not designed for long running processes and it causes CI/CD pipeline traces to sometimes look a bit weird but the pros&cons of using traces for CI/CD has proven to be extremely positive for both Jenkins and Maven use cases and we see a growing number of CI/CD tools that embrace OTel traces so I think we should continue in this direction.

christophe-kamphaus-jemmic commented 1 month ago

Thanks @adrielp this looks very good. Should attributes for URL be added? eg. pipeline.url would link to the CI page of the pipeline For grouping pipelines (eg. Jenkins folders or multibranch jobs) should there be an attribute like pipeline.group or pipeline.parent? Or would this be something better represented by the coming Entity model?

Edit: I posted these questions as part of the PR review

dailyherold commented 2 weeks ago

Great discussion, thanks for starting @adrielp . Some ideas on fields, using draft taxonomy, based on past implementations of pipeline emitted events I found helpful:

thompson-tomo commented 2 weeks ago

A couple of thought's from my side would be:

Obviously i wouldn't do a one to one mapping of the events but instead leverage a span to record the entire action an use attributes to signify the result ie deployment,action which could be rollback, upgrade or new

Happy to go through and do more of a mapping of these CD events to attributes.