open-telemetry / opentelemetry-collector

OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
4.36k stars 1.45k forks source link

Proto mismatch errors not obvious when posting traces #6320

Open Pitta opened 2 years ago

Pitta commented 2 years ago

Describe the bug When sending test traces via otlp-http, if the format of the trace is mismatched the server still returns a 200 with no indication of a problem.

$ curl -i http://localhost:4318/v1/traces -X POST -H "Content-Type: application/json" -d @trace.json

HTTP/1.1 200 OK
Content-Type: application/json
Date: Sat, 15 Oct 2022 00:30:09 GMT
Content-Length: 2

{}

Steps to reproduce config: BELOW

docker-compose.yaml:

version: "3"
services:
  otel-collector:
    image: otel/opentelemetry-collector:0.60.0
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC receiver
      - "4318:4318"   # OTLP http receiver
      - "13133:13133" # Health Check

trace.json (from this example)

{
  "resourceSpans": [
    {
      "resource": {
        "attributes": [
          {
            "key": "service.name",
            "value": {
              "stringValue": "curl-test-otel-pipeline"
            }
          }
        ]
      },
      "instrumentationLibrarySpans": [
        {
          "spans": [
            {
              "traceId": "71699b6fe85982c7c8995ea3d9c95df2",
              "spanId": "3c191d03fa8be065",
              "name": "test-span",
              "kind": 1,
              "droppedAttributesCount": 0,
              "events": [],
              "droppedEventsCount": 0,
              "status": {
                "code": 1
              }
            }
          ],
          "instrumentationLibrary": {
            "name": "local-curl-example"
          }
        }
      ]
    }
  ]
 }
  1. docker compose up
  2. from another terminal curl -i http://localhost:4318/v1/traces -X POST -H "Content-Type: application/json" -d @trace.json

What did you expect to see? A 200 response if the trace data was 👍 Anything else with a message mentioning the proto mistmatch

What did you see instead?

$ curl -i http://localhost:4318/v1/traces -X POST -H "Content-Type: application/json" -d @trace.json

HTTP/1.1 200 OK
Content-Type: application/json
Date: Sat, 15 Oct 2022 00:30:09 GMT
Content-Length: 2

{}

What version did you use? Version: 0.60.0

What config did you use?

exporters:
  logging:

extensions:
  health_check: {}
  memory_ballast: {}

processors:
  batch: {}
  memory_limiter:
    check_interval: 5s
    limit_mib: 409
    spike_limit_mib: 128

receivers:
  otlp:
    protocols:
      grpc:
      http:

service:
  extensions:
  - health_check
  - memory_ballast
  pipelines:
    traces:
      exporters:
      - logging
      - otlp
      processors:
      - memory_limiter
      - batch
      receivers:
      - otlp
  telemetry:
    logs:
      level: "debug"
austinlparker commented 2 years ago

FYI I replicated this on latest (0.62.1) with the same config + json payload.

TylerHelmuth commented 2 years ago

I've seen a customer experience this issue recently as well when sending traces via the Swift SDK, which didn't have the latest proto yet.

bogdandrutu commented 2 years ago

This is a bit unfortunate, I can explain why this happens, not sure what is the right behavior:

The combination of the 2 makes this case to return 200. Not sure how based on the current spec we can return a different code.

austinlparker commented 2 years ago

I think if we returned a useful response body, it'd help. Right now you get 200 OK with an empty response; something like "dropped unknown fields [xxx, yyy, zzz]" would at least give users a clue what was happening.

TylerHelmuth commented 2 years ago

In the PR that implemented the spec change in the collector we discussed a scenario like this: https://github.com/open-telemetry/opentelemetry-collector/pull/5931. Maybe it's time to allow "strict-enforcement" via https://github.com/open-telemetry/opentelemetry-collector/issues/5935.

I also like the idea of having some important fields, like ResourceSpans and InstrumentationScopeSpans, that are vocal when skipped. This would help users know what's happening.

TylerHelmuth commented 2 years ago

As discussed in the SIG today, maybe OTLP's partial success response can help here.

tigrannajaryan commented 2 years ago

As discussed in the SIG today, maybe OTLP's partial success response can help here.

To clarify: for this to work we need to add "accepted" counter to the partial success response.