[zipkin receiver]: invalid spans cause the entire batch to be rejected

sfishel-splunk commented 2 years ago

Describe the bug When a batch of spans sent to the zipkin receiver includes an invalid span (e.g. has a negative duration), the entire batch is rejected.

Steps to reproduce

# command
curl --location --request POST 'http://localhost:9411/api/v2/spans' \
--header 'content-type: application/json' \
--data-raw '[
  {
    "id": "352bff9a74ca9ad2",
    "traceId": "5af7183fb1d4cf5f",
    "parentId": "6b221d5bc9e6496c",
    "name": "get /api",
    "timestamp": 1556604172355737,
    "duration": 1431,
    "kind": "SERVER",
    "localEndpoint": {
      "serviceName": "backend",
      "ipv4": "192.168.99.1",
      "port": 3306
    },
    "remoteEndpoint": {
      "ipv4": "172.19.0.2",
      "port": 58648
    },
    "tags": {
      "http.method": "GET",
      "http.path": "/api"
    }
  },
  {
    "id": "352bff9a74ca9ad3",
    "traceId": "5af7183fb1d4cf5f",
    "parentId": "6b221d5bc9e6496c",
    "name": "get /api",
    "timestamp": 1556604172355737,
    "duration": -1431,
    "kind": "SERVER",
    "localEndpoint": {
      "serviceName": "backend",
      "ipv4": "192.168.99.1",
      "port": 3306
    },
    "remoteEndpoint": {
      "ipv4": "172.19.0.2",
      "port": 58648
    },
    "tags": {
      "http.method": "GET",
      "http.path": "/api"
    }
  }
]'

# output
json: cannot unmarshal number -1431 into Go struct field .duration of type uint64

What did you expect to see? I expected that the invalid span would be ignored but the rest of the batch would be accepted.

What did you see instead? The entire batch was rejected.

What version did you use? 0.36.0

What config did you use?


exporters:
  sapm/sfmonitoring:
    access_token: ${ACCESS_TOKEN}
    endpoint: https://ingest.lab0.signalfx.com/v2/trace
    sending_queue:
      queue_size: 512
  sapm/sfmonitoring_rc:
    access_token: ${ACCESS_TOKEN_RC}
    endpoint: https://ingest.rc0.signalfx.com/v2/trace
    sending_queue:
      queue_size: 512
  signalfx/sfmonitoring:
    access_token: ${ACCESS_TOKEN}
    api_url: https://api.lab0.signalfx.com
    ingest_url: https://ingest.lab0.signalfx.com
    sending_queue:
      queue_size: 512
    timeout: 5s
  signalfx/sfmonitoring_rc:
    access_token: ${ACCESS_TOKEN_RC}
    access_token_passthrough: false
    api_url: https://api.rc0.signalfx.com
    ingest_url: https://ingest.rc0.signalfx.com
    sending_queue:
      queue_size: 512
    timeout: 5s
  splunk_hec:
    endpoint: https://ingest.lab0.signalfx.com/v1/log
    index: main
    sending_queue:
      queue_size: 512
    token: ${ACCESS_TOKEN}
extensions:
  health_check: {}
processors:
  batch:
    send_batch_size: 1024
    timeout: 1s
  k8sattributes:
    pod_association:
    - from: resource_attribute
      name: k8s.pod.uid
  resource/add_cluster_name:
    attributes:
    - action: upsert
      key: k8s.cluster.name
      value: lab0
  resource/add_collector_attributes:
    attributes:
    - action: insert
      key: k8s.node.name
      value: ${K8S_NODE_NAME}
    - action: insert
      key: k8s.pod.name
      value: ${K8S_POD_NAME}
    - action: insert
      key: k8s.pod.uid
      value: ${K8S_POD_UID}
    - action: insert
      key: k8s.namespace.name
      value: ${K8S_NAMESPACE}
  resource/add_realm:
    attributes:
    - action: insert
      key: sfx_realm
      value: lab0
  resource/environment:
    attributes:
    - action: insert
      key: deployment.environment
      value: lab0
  resource/rename_pod_uid:
    attributes:
    - action: insert
      from_attribute: kubernetes_pod_uid
      key: k8s.pod.uid
receivers:
  collectd:
    endpoint: 0.0.0.0:8081
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268
  otlp:
    protocols:
      grpc: null
      http: null
  prometheus/collector:
    config:
      scrape_configs:
      - job_name: otel-collector
        scrape_interval: 10s
        static_configs:
        - targets:
          - ${K8S_POD_IP}:8888
  sapm:
    endpoint: 0.0.0.0:7276
  signalfx:
    access_token_passthrough: true
    endpoint: 0.0.0.0:9943
  zipkin:
    endpoint: 0.0.0.0:9411
service:
  extensions:
  - health_check
  pipelines:
    logs:
      exporters:
      - splunk_hec
      processors:
      - memory_limiter
      - batch
      receivers:
      - otlp
    logs/signalfx-events:
      exporters:
      - signalfx/sfmonitoring
      processors:
      - memory_limiter
      - batch
      receivers:
      - signalfx
    metrics:
      exporters:
      - signalfx/sfmonitoring
      - signalfx/sfmonitoring_rc
      processors:
      - memory_limiter
      - k8sattributes
      - batch
      receivers:
      - signalfx
      - collectd
    metrics/collector:
      exporters:
      - signalfx/sfmonitoring
      - signalfx/sfmonitoring_rc
      processors:
      - memory_limiter
      - batch
      - resource/add_cluster_name
      - resource/add_collector_attributes
      - resource/add_realm
      receivers:
      - prometheus/collector
    traces:
      exporters:
      - sapm/sfmonitoring
      - sapm/sfmonitoring_rc
      processors:
      - memory_limiter
      - batch
      - resource/environment
      - attributes/rum-api
      - attributes/rum-api-dns
      - attributes/rum-api-post
      - attributes/rum-presto
      - resource/rename_pod_uid
      - k8sattributes
      receivers:
      - otlp
      - jaeger
      - zipkin
      - sapm
  telemetry:
    metrics:
      address: 0.0.0.0:8888

Environment quay.io/signalfx/splunk-otel-collector docker image

dmitryax commented 2 years ago

Hi @sfishel-splunk, this is done by design. json payload is unmarshalled altogether. If the payload is invalid, it is rejected. Adding an additional step to break down the json batch payload into individual jsons per span would significantly increase the cost. I don't think we want to do that.

dmitryax commented 2 years ago

I will keep this open for a while if anyone else has other opinions. Or, @sfishel-splunk, if you are interested to experiment and try to make this happen with low overhead (or as an optional functionality), feel free to do so,

sfishel-splunk commented 2 years ago

@dmitryax makes sense, if it's for performance reasons that's understandable. i'm not much of a Go programmer so i'm unlikely to try this myself :) but it's not blocking me or anything, i just filed it when i noticed it.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

github-actions[bot] commented 1 year ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

open-telemetry / opentelemetry-collector-contrib

[zipkin receiver]: invalid spans cause the entire batch to be rejected #11024