Spec format for spans in build logs and bash script output #2773

Closed abitrolly closed 2 years ago

What are you trying to achieve?

Given a build log https://gitlab.com/gitlab-org/gitlab-docs/-/jobs/2983094851 I want to quickly see what took the most time in build process. I can modify GitLab code to render spans, but I need a uniform way to report span timings in output stream so that tools can parse this in a universal way.

What did you expect to see?

A human readable and machine readable one liners to output span info.

One possibility is breadcrumb syntax.

I <span name> start |
| <span name> end | <time>s |

And an example.

| minifying js start |
| minifying js end | 144.18s |

Additional context.

Add any other context about the problem here. If you followed an existing documentation, please share the link to it.

Could not find any diagrams, and/or examples of spans in any representation format.

Hi @abitrolly! Can you please elaborate on why you think this should be handled by the OpenTelemetry specification and what you would expect to have specified? Is your question about an OpenTelemetry SDK in a particular language or the OpenTelemetry Collector?

@arminru you are the expert in this repo, so I don't know where it belong. I just came here with the problem that I need to extract telemetry from build logs. This belongs to specification, because without the spec it is hard to persuade people to format the messages in a suitable way for a wider range of tools, rather than my personal script.

@abitrolly So you're not looking for a particular format to be used by the logging/console/stdout exporters (already) built into most OpenTelemetry SDK (e.g. the ConsoleSpanExporter for OTel JS) and the OpenTelemetry Collector but rather for a general format to be suggested which third party tools could follow?

@arminru if there is a particular standardized format that I can reuse, I am all interested in that. But I could not find anything here.

(e.g. the ConsoleSpanExporter for OTel JS)

I see that it is a class https://open-telemetry.github.io/opentelemetry-js/classes/_opentelemetry_sdk_trace_base.ConsoleSpanExporter.html and not a format, so no examples, "prints spans", but no stated goal that the spans are actually machine readable. So I don't see it is a suitable lead.

@abitrolly Indeed. I think all of the logging/console/stdout exporters in OTel are intended to be human readable for development and debugging and not intended for machine-readability since there are plenty of machine-readable export formats available (first and foremost OTLP). There is a file exporter in the OTel collector which uses the OTLP/JSON format but this does not guarantee stable field names either. Do you think you'd be able to build an integration that consumes OTLP, our well-structured, machine-readable format? If not, can you explain, why having a text-based format is the only reasonable approach for you? Maybe you can find more users to support this idea, so we can consider it for the specification.

@arminru from the link it looks like OTLP is a protocol (as described in its title), and while it describes that it uses "protobuf payloads in JSON format", I don't see examples to figure out how it looks like and if it is suitable to produces from build logs. You say there a plenty of machine-readable formats available. I would prefer the one standardized, but if there are links to a better examples for that specific use case, I am interested to study them.

Do you think you'd be able to build an integration that consumes OTLP, our well-structured, machine-readable format?

I can not see that format. It assumes that the reader can read protobuf protocol, but I can't, so diagrams and example would be beneficial to be on the same wavelength.

If not, can you explain, why having a text-based format is the only reasonable approach for you?

I am open to proposals. My use case if explained in the issue description. I've got a plenty of build logs, and spending time instrumenting every tools in pipeline to produce protobuf output to a specified server it not effective. All information about timings is already present in build logs. It is hard to collect, because there is no standard.

@abitrolly

I can not see that format. It assumes that the reader can read protobuf protocol, but I can't, so diagrams and example would be beneficial to be on the same wavelength.

The protocol is defined in https://github.com/open-telemetry/opentelemetry-proto with the definition for traces being in https://github.com/open-telemetry/opentelemetry-proto/blob/main/opentelemetry/proto/trace/v1/trace.proto. It would mean that your app would need a protobuf client for it, which can be generated for a wide number of languages based on the .proto definitions in that repo.

I've got a plenty of build logs, and spending time instrumenting every tools in pipeline to produce protobuf output to a specified server it not effective.

So the tools in your pipeline are already instrumented with OTel and have the respective languages logging exporter in their pipeline to print to those build logs? If that's the case, then there should be ways to configure them to export OTLP to a tracing backend or - if you decide to do so - your custom app that takes those span timings to produce an analysis for you. This might not even require you to change any code but based on how the respective SDKs are set up, it might be sufficient to just set the respective env vars (first and foremost OTEL_EXPORTER_OTLP_ENDPOINT or OTEL_EXPORTER_OTLP_TRACES_ENDPOINT) as described in https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md. You might even be able to just expose the ingest endpoint of your app on the default port.

It would mean that your app would need a protobuf client for it, which can be generated for a wide number of languages based on the .proto definitions in that repo.

@arminru build logs are produced by build scripts, which are often written in bash or in .yaml that calls bash of powerscript, or Python. How do you propose to generate protobuf for this output in the log that I need to process? Not only this, but all timestamps and turn them into spans to see which tasks are nested.

Done in 154.21s.

I've got a plenty of build logs, and spending time instrumenting every tools in pipeline to produce protobuf output to a specified server it not effective.

So the tools in your pipeline are already instrumented with OTel and have the respective languages logging exporter in their pipeline to print to those build logs?

No. If they were instrumented, I wouldn't have to spend the time to instrument them.

Even if I had the time, I don't think that developers of tools that are used in build pipelines, such as C compilers or Python linters agree that they need to support OLTP introducing additional externals dependencies and adding all the complexity and overhead of compiling protobufs in their development and release processes. In my practice releasing Python lib with protobuf was one hell of experience.

@abitrolly So the pipelines and build tools you're trying to analyze are not instrumented with OpenTelemetry at all and you're also not looking to instrument them with OpenTelemetry? How is your question related to OpenTelemetry and the OTel spec then?

Tigran will pursue further to see if this is a duplicate.

It seems to me that the use case you are looking for would traditionally be solved with an OpenTelemetry client implementation for bash, aka a CLI interface for OpenTelemetry.

You don't need a Protobuf library to produce OTLP/JSON. JSON is just a text that you can output from bash as well. Yes, it may be cumbersome, so that's why the question: is OTLP/JSON standard good enough for this use case? In your build tool you can structure the output like this:

{"resource_spans": [{ "scope_spans": [{ "spans": [{ "trace_id": "1", "span_id": "1", "name": "minifying", "start_time_unix_nano": 1544712660300000000,  "end_time_unix_nano": 1544712660600000000}}}}

This is valid OTLP/JSON.

Otherwise, generally speaking I don't see how the standardization of this particular use case belongs to OpenTelemetry. I have not seen in the past the need to express build-time process durations as spans. It may be a totally valid use case for you, but I do not see the widespread need for it that can warrant coming up with a new standardized text format for spans that becomes part of the OpenTelemetry spec.

I am going to mark this as "won't do" for now, however please feel free to comment. If you have additional arguments in favour of your proposal, we will reconsider and if there is enough interest in the community we can work on it.

I am going to keep this open for now to allow time to respond. If we don't hear new supporting arguments we will close the issue.

It seems to me that the use case you are looking for would traditionally be solved with an OpenTelemetry client implementation for bash, aka a CLI interface for OpenTelemetry.

There are even some experimental OSS CLI tools for creating OTel traces: https://github.com/equinix-labs/otel-cli https://github.com/krzko/opentelemetry-shell

You don't need a Protobuf library to produce OTLP/JSON. JSON is just a text that you can output from bash as well. Yes, it may be cumbersome, so that's why the question: is OTLP/JSON standard good enough for this use case? In your build tool you can structure the output like this:
{"resource_spans": [{ "scope_spans": [{ "spans": [{ "trace_id": "1", "span_id": "1", "name": "minifying", "start_time_unix_nano": 1544712660300000000,  "end_time_unix_nano": 1544712660600000000}}}}

Thanks. That's the example that I could not find in entire spec. If the spec could contain a TL;DR chapter with the example of how spans desciption looks like and how it is finally rendered, it would save a lot of hours. I remember Zipkin had a very inspring web site for newcomers.

I had to insert a couple of ]]] to make the example a valid JSON.

{
  "resource_spans": [
    {
      "scope_spans": [
        {
          "spans": [
            {
              "trace_id": "1",
              "span_id": "1",
              "name": "minifying",
              "start_time_unix_nano": 1544712660300000000,
              "end_time_unix_nano": 1544712660600000000
            }
          ]
        }
      ]
    }
  ]
}

So only the full span is sent? I need to trace a CI/CD process that can never complete, and I'd like trace it in real time, so that span start is rendered even if the end is not received. Is that possible with OTEL/JSON?

This is valid OTLP/JSON.

Otherwise, generally speaking I don't see how the standardization of this particular use case belongs to OpenTelemetry. I have not seen in the past the need to express build-time process durations as spans. It may be a totally valid use case for you, but I do not see the widespread need for it that can warrant coming up with a new standardized text format for spans that becomes part of the OpenTelemetry spec.

I am going to mark this as "won't do" for now, however please feel free to comment. If you have additional arguments in favour of your proposal, we will reconsider and if there is enough interest in the community we can work on it.

If the spec could contain a TL;DR chapter with the example of how spans desciption looks like and how it is finally rendered, it would save a lot of hours.

That's a good point. I created an issue to add some examples.

So only the full span is sent? I need to trace a CI/CD process that can never complete, and I'd like trace it in real time, so that span start is rendered even if the end is not received. Is that possible with OTEL/JSON?

No, this is not possible in OTLP/JSON. Generally speaking, this is not possible in OpenTelemetry. Only complete spans can be exported from OpenTelemetry SDK, so there is no support for incomplete spans in OTLP either.

No, this is not possible in OTLP/JSON. Generally speaking, this is not possible in OpenTelemetry. Only complete spans can be exported from OpenTelemetry SDK, so there is no support for incomplete spans in OTLP either.

Thanks. I can't imagine how much I would need to read to get that answer.

That makes it hard to debug long running jobs. I would use this Jsonl as a minimal format to track duration and dependency of various jobs in real-time.

{"span": "minifying", "start": "09:07:00"}
{"span": "minifying", "duration": "02:00"}

This may become a replacement or an addition to formats that CI/CD systems currently use to make build logs more readable.

https://docs.gitlab.com/ee/ci/jobs/index.html#custom-collapsible-sections https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#grouping-log-lines

With uniform format for embedding span info into web page content, it will even be possible to view the trace from F12 browser development tools. Which means avoid installing additional Jaeger server, authenticating, and searching for the trace that is produced by a current page in a separate tab.

Generally speaking, this is not possible in OpenTelemetry. Only complete spans can be exported from OpenTelemetry SDK

Small correction on this. In theory I think OpenTelemetry API could support this if the destination format allowed this (since Span creation API only requires the start timestamp), but in practice I don't think we have any implementations like this. And it is definitely not possible in OTLP.

That makes it hard to debug long running jobs. I would use this Jsonl as a minimal format to track duration and dependency of various jobs in real-time.

Why not record this data as Otel LogRecords? Why does it have to be Spans?

To me this sounds like a topic of semantic conventions for instrumenting build pipelines. See the following sample payload (the timestamps are not adjusted) - we can model the build process using a combination of Spans and standalone Events. The build job would be the parent span and each step in the build would be a child span. And while a span is in progress, you can emit standalone Events (with an association to the span) to indicate progress (this is not in the sample below).

It is possible even today that parent spans don't get reported, in which case you will have a partial trace. You can use the partial trace for visualization purpose. We can address the gap of a parent span not getting reported, via a standalone Event. That is, you can emit an event every time you start a span.

We intend to follow this approach in browser instrumentation for the page views. The PageView span takes some time to build and so there is a risk of losing it if the user navigates away, so we intend to emit an Event at the earliest opportunity and send the Span later.

Here's the visualization and sample payload.

 |------ job a-------------------------|
    |-- build --|
                 |--test--|
                            |--package--|

And the event timeline:

.A..B............C..........D......

where
Event A - job a started
Event B - build step started
Event C - test step started
Event D - package step started

{
    "resourceSpans": [
      {
        "resource": {
          "attributes": [
          ],
          "droppedAttributesCount": 0
        },
        "scopeSpans": [
          {
            "spans": [
              {
                "traceId": "b5dd69fce623bfcecfa6747ac6c7a247",
                "spanId": "a06a1e805c399a37",
                "parentSpanId": "2a167ab1c1b04756",
                "name": "Foo",
                "startTimeUnixNano": 1659114423246200000,
                "endTimeUnixNano": 1659114423262200000,
                "attributes": [
                  {
                    "key": "build.step.id",
                    "value": {
                      "stringValue": "build"
                    }
                  }
                ],
                "droppedAttributesCount": 0
               },
               {
                "traceId": "b5dd69fce623bfcecfa6747ac6c7a247",
                "spanId": "a06a1e805c399a37",
                "parentSpanId": "2a167ab1c1b04756",
                "name": "Foo",
                "startTimeUnixNano": 1659114423246200000,
                "endTimeUnixNano": 1659114423262200000,
                "attributes": [
                  {
                    "key": "build.step.id",
                    "value": {
                      "stringValue": "test"
                    }
                  }
                ],
                "droppedAttributesCount": 0
               },
               {
                "traceId": "b5dd69fce623bfcecfa6747ac6c7a247",
                "spanId": "a06a1e805c399a37",
                "parentSpanId": "2a167ab1c1b04756",
                "name": "Foo",
                "startTimeUnixNano": 1659114423246200000,
                "endTimeUnixNano": 1659114423262200000,
                "attributes": [
                  {
                    "key": "build.step.id",
                    "value": {
                      "stringValue": "package"
                    }
                  }
                ],
                "droppedAttributesCount": 0
               },
               {
                "traceId": "b5dd69fce623bfcecfa6747ac6c7a247",
                "spanId": "2a167ab1c1b04756",
                "name": "Foo",
                "startTimeUnixNano": 1659114423246200000,
                "endTimeUnixNano": 1659114423262200000,
                "attributes": [
                  {
                    "key": "build.job.name",
                    "value": {
                      "stringValue": "job-a"
                    }
                  },
                  {
                    "key": "build.job.id",
                    "value": {
                      "stringValue": "123"
                    }
                  }

                ],
                "droppedAttributesCount": 0
               },
            ],    
            "scope": {
              "name": "@opentelemetry/instrumentation-build",
              "version": "0.0.1"
            }
          }
        ]
      }
    ]
  }

Why not record this data as Otel LogRecords? Why does it have to be Spans?

@tigrannajaryan build log already contains log records. There is no need to add another interface to read them. But identifying bottlenecks where the processing stalls and takes a lot of time is the problem that is not solved programmatically, and for that real-time spans are needed.

It is possible even today that parent spans don't get reported, in which case you will have a partial trace. You can use the partial trace for visualization purpose. We can address the gap of a parent span not getting reported, via a standalone Event. That is, you can emit an event every time you start a span.

@scheler yes. Maybe there is no point in separate span payload then. If everything can be encoded as an event.

@abitrolly the reason for using spans would be to leverage the tooling built for spans by the telemetry backends (visualization, analytics, correlation, etc). Otherwise, I agree that everything can be encoded in the LogRecord data model (which is the basis for Events too).

Prioritizing representation over real world use cases seems like an omission when designing telemetry collection framework and its underlying data formats. Is there a list of use cases that had been considered when designing the spec? To see if my story of recording incomplete spans matches any of them?

Regardless of historic decisions, I am still interested to see some solution I can adapt.

Buildkite seems to use timestamp parsing to detect operation durations, and while start of operation is marked by some sequence like +++, there are no signals when operations end. So, for example, if the build process is wrapped into Dockerfile, it is no longer possible to trace span nesting and durations. Some standard on these hacks is needed to speed up the progress.

https://buildkite.com/sorbet/sorbet/builds/24799#01837d5f-70eb-43e9-8100-11c5fb92cb16

@tigrannajaryan

Small correction on this. In theory I think OpenTelemetry API could support this if the destination format allowed this (since Span creation API only requires the start timestamp), but in practice I don't think we have any implementations like this.

We do have the Z-Pages feature: https://github.com/open-telemetry/opentelemetry-specification/blob/main/experimental/trace/zpages.md which is implemented in Java: https://github.com/open-telemetry/opentelemetry-java/tree/main/sdk-extensions/incubator/src/main/java/io/opentelemetry/sdk/extension/incubator/trace/zpages

We also export spans before they are ended in Dynatrace.

But this discussion maybe belongs to https://github.com/open-telemetry/opentelemetry-specification/issues/373

Closing since this item has been inactive for a couple weeks. Please re-open if there is further interest.

I do have interest in retracing operations from debug logs, using incomplete spans. However, I can not reopen this issue.

zPages from open-telemetry/opentelemetry-specification#373 and https://github.com/open-telemetry/opentelemetry-specification/blob/v1.16.0/experimental/trace/zpages.md propose to expose current incomplete/inprocess span metrics, which is a great idea, but it doesn't really solve my problem of recording and telemetry about build process in human readable form in build logs.

Like installing GitLab Development Kit takes from 10 to 30 minutes. Until the installation is finished, no services (including Jaeger) is available.

So far I have these requirements for recording span data in build logs.

Span info should be human readable
Span info should be machine readable
Incomplete spans span indefinitely
Span info takes one line

Variant 1. JSON Lines. Start/duration.

{"span": "minifying", "start": "09:07:00"}
{"span": "minifying", "duration": "02:00"}

Pros - easy to parse. Cons - hard to read, hard to scan for visually, can get lost among other JSON output.

Variant 2. Star Wars. Duration.

|-- build --|
|-- build --| 02:00

Pros - readable, easy to search for, parser is \|-- (\w+) --\|( ([:0-9]+))? Cons - looks quite alien.

Parsing logic

If first match is without duration, open span with that name
If the first match is with duration, open span from the beginning
If the next match is without duration, open new nested span with that name (even if the name matches, but notify about that if possible)
If the next match is with duration, and name matches one of the previous, close that previous
If the next match is with duration, but the name is not known, open new nested span from the beginning of current span

I think that should cover it.

open-telemetry / opentelemetry-specification