open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.92k stars 2.28k forks source link

probabilisticsampler processor stops sampling 'sampling_percentage: 60' #30079

Open Sakib37 opened 9 months ago

Sakib37 commented 9 months ago

Component(s)

processor/probabilisticsampler

What happened?

Description

I am trying to control the percentage of logs that will be shipped to the backend. I am using probabilisticsamplerprocessor. During this test there was no change in the number of logs in the cluster(i.e. no new pods are added in the cluster)

I am using the following config

      probabilistic_sampler/logs:
        hash_seed: 22
        sampling_percentage: 98
        attribute_source: record
        from_attribute: "cluster" # This attribute is added via 'resource' processor

With this config, I get around 1.8K logs in Datadog dashbaord. Now, I gradually reduce the sampling_percentage from 98 to 90, 80, 70, 65, 60. But in Datadog I see so significant effect of this sampling until sampling_percentage 65 and the total amount of logs stays almost the same.

However, when I set sampling_percentage to 60, there are no logs available in the backend(Datadog). I tried the following two configs as well

  probabilistic_sampler/logs:
    hash_seed: 22
    sampling_percentage: 98
  probabilistic_sampler/logs:
    sampling_percentage: 98

In every case, when I set sampling_percentage to 60, there is no log in the backend. My log pipeline in otel collector is as below

 logs/datadog:
      exporters:
      - debug
      - datadog
      processors:
      - resource/common
      - k8sattributes
      - memory_limiter
      - probabilistic_sampler/logs
      - batch/logs
      - transform/filelog_labels
      receivers:
      - filelog

Steps to Reproduce

Try to sample logs using probabilisticsamplerprocessor and set sampling_percentage to 60 or below.

Expected Result

I expect accurate sampling based on percentage. If with 65% sampling I get 1k logs then with 60% sampling I should at least get ~900 log lines in the backend.

Actual Result

No logs in the backend after setting sampling_percentage to 60

Collector version

0.91.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04") Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
    filelog:
      exclude: []
      include:
      - /var/log/pods/*/*/*.log
      include_file_name: false
      include_file_path: true
      operators:
      - id: get-format
        routes:
        - expr: body matches "^\\{"
          output: parser-docker
        - expr: body matches "^[^ Z]+ "
          output: parser-crio
        - expr: body matches "^[^ Z]+Z"
          output: parser-containerd
        type: router
      - id: parser-crio
        regex: ^(?P<time>[^ Z]+) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
        timestamp:
          layout: 2006-01-02T15:04:05.999999999Z07:00
          layout_type: gotime
          parse_from: attributes.time
        type: regex_parser
      - combine_field: attributes.log
        combine_with: ""
        id: crio-recombine
        is_last_entry: attributes.logtag == 'F'
        max_log_size: 102400
        output: extract_metadata_from_filepath
        source_identifier: attributes["log.file.path"]
        type: recombine
      - id: parser-containerd
        regex: ^(?P<time>[^ ^Z]+Z) (?P<stream>stdout|stderr) (?P<logtag>[^ ]*) ?(?P<log>.*)$
        timestamp:
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
          parse_from: attributes.time
        type: regex_parser
      - combine_field: attributes.log
        combine_with: ""
        id: containerd-recombine
        is_last_entry: attributes.logtag == 'F'
        max_log_size: 102400
        output: extract_metadata_from_filepath
        source_identifier: attributes["log.file.path"]
        type: recombine
      - id: parser-docker
        output: extract_metadata_from_filepath
        timestamp:
          layout: '%Y-%m-%dT%H:%M:%S.%LZ'
          parse_from: attributes.time
        type: json_parser
      - id: extract_metadata_from_filepath
        parse_from: attributes["log.file.path"]
        regex: ^.*\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[a-f0-9\-]+)\/(?P<container_name>[^\._]+)\/(?P<restart_count>\d+)\.log$
        type: regex_parser
      - from: attributes.stream
        to: attributes["log.iostream"]
        type: move
      - from: attributes.container_name
        to: resource["k8s.container.name"]
        type: move
      - from: attributes.namespace
        to: resource["k8s.namespace.name"]
        type: move
      - from: attributes.pod_name
        to: resource["k8s.pod.name"]
        type: move
      - from: attributes.restart_count
        to: resource["k8s.container.restart_count"]
        type: move
      - from: attributes.uid
        to: resource["k8s.pod.uid"]
        type: move
      - from: attributes.log
        to: body
        type: move

  batch/logs:
      # send_batch_max_size must be greater or equal to send_batch_size
      send_batch_max_size: 11000
      send_batch_size: 10000
      timeout: 10s

  transform/filelog_labels:
          log_statements:
          - context: log
            statements:
            # For the index
            - set(resource.attributes["service.name"], "integrations/kubernetes/logs")
            - set(resource.attributes["cluster"], attributes["cluster"])
            - set(resource.attributes["pod"], resource.attributes["k8s.pod.name"])
            - set(resource.attributes["container"], resource.attributes["k8s.container.name"])
            - set(resource.attributes["namespace"], resource.attributes["k8s.namespace.name"])
            - set(resource.attributes["filename"], attributes["log.file.path"])
            - set(resource.attributes["loki.resource.labels"], "pod, namespace, container, cluster, filename")
            # For the body
            - set(resource.attributes["loki.format"], "raw")
            - >
              set(body, Concat([
                Concat(["name", resource.attributes["k8s.object.name"]], "="),
                Concat(["kind", resource.attributes["k8s.object.kind"]], "="),
                Concat(["action", attributes["k8s.event.action"]], "="),
                Concat(["objectAPIversion", resource.attributes["k8s.object.api_version"]], "="),
                Concat(["objectRV", resource.attributes["k8s.object.resource_version"]], "="),
                Concat(["reason", attributes["k8s.event.reason"]], "="),
                Concat(["type", severity_text], "="),
                Concat(["count", resource.attributes["k8s.event.count"]], "="),
                Concat(["msg", body], "=")
              ], " "))

  exporters:
    debug: {}
    datadog:
      api:
        key: $${env:DATADOG_API_KEY}
        site: datadoghq.com

  service:
    extensions:
      - health_check
      - memory_ballast

  pipelines:
    logs/datadog:
        receivers:
          - filelog
        processors:
          - resource/common
          - k8sattributes
          - memory_limiter
          - probabilistic_sampler/logs
          - batch/logs
          #- transform/filelog_labels
        exporters:
          - debug
          - datadog

Log output

2023-12-19 10:43:45,447 INFO app [trace_id=de6328f2336ce1f7feeee7b512330250 span_id=1dcf616458576690 resource.service.name=ping_pong] waitress-2 : custom log 2
2023-12-19 10:43:45,959 INFO app [trace_id=a5e9e494204dd4feef2d5b1b90a04d7a span_id=dc3a55ba5efb65ab resource.service.name=ping_pong] waitress-3 : custom log 1
2023-12-19 10:43:45,959 INFO app [trace_id=a5e9e494204dd4feef2d5b1b90a04d7a span_id=e25a65fd90261b82 resource.service.name=ping_pong] waitress-3 : custom log 2
2023-12-19 10:43:46,484 INFO app [trace_id=c92a5ae80e33c1bc7f072067d204e2e5 span_id=08a8f332bb7eea3a resource.service.name=ping_pong] waitress-1 : custom log 1
2023-12-19 10:43:46,490 INFO app [trace_id=c92a5ae80e33c1bc7f072067d204e2e5 span_id=88b75c683989c2ef resource.service.name=ping_pong] waitress-1 : custom log 2
2023-12-19 10:43:47,024 INFO app [trace_id=ba552851ef2df3eaf9fddeae77ed40c2 span_id=d5a840450ec02c9e resource.service.name=ping_pong] waitress-0 : custom log 1

Additional context

No response

github-actions[bot] commented 9 months ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

atoulme commented 9 months ago

That all depends on the attribute "record" that you use as source of sampling decision. It looks like it's not properly distributed, and therefore you get uneven results.

pierzapin commented 8 months ago

Just adding a +1 to this report. Similar barebones config and binary outcome. I see that @Sakib37 experienced this without the "attribute_source: record" configuration (as did I) which would indicate that @atoulme 's observation here is unlikely to be the only factor

jpkrohling commented 7 months ago

Would you please provide the state of the metric count_logs_sampled, as well as the receiver's "accepted span" and the exporter's "sent spans"? This would help understand where the problem might be.

github-actions[bot] commented 5 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 2 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling commented 2 months ago

@jmacd , do you have time to look into this one?

github-actions[bot] commented 1 week ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.