Can't avoid duplicate metrics in short lived cloud functions e.g. GCP Cloud Function

AkselAllas commented 2 hours ago

Component(s)

exporter/googlecloud

Describe the issue you're reporting

I am calling forceFlush multiple times fast (e.g 2x in 0.5 sec) because GCP Cloud Functions run once and detach CPU, so oftentimes a periodic metric exporter will either fail to export, because CPU/network has detached before metric export or it will cause error spam due to trying export after CPU/network has detached.

Is it possible to call metric forceFlush in e.g. nodejs metricReader.forceFlush() multiple times and somehow not end up with duplicate metrics errors in otel collector?

E.g. can I somehow use otel collector processor to remove duplicates before export? My main problem is that duplicate errors are creating noise in otelcol_exporter_send_failed_metric_points_total metric, which I am using to detect lost metrics.

github-actions[bot] commented 2 hours ago

Pinging code owners:

exporter/googlecloud: @aabmass @dashpole @jsuereth @punya @damemi @psx95

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dashpole commented 1 hour ago

@psx95, I know you looked into this recently. Can you respond?

Also @AkselAllas can you share more about your setup? Is your application sending to a collector? Or directly to google cloud?

dashpole commented 1 hour ago

@AkselAllas can you share your collector config? Since cloud monitoring can only accept points every 5 seconds, you will need to aggregate over time to avoid errors. Something like https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/intervalprocessor should be what you need, but it is listed as being under development right now

AkselAllas commented 45 minutes ago

@dashpole How would the linked processor work with 10 sec batch? I have e.g.

  batch:
    # batch metrics before sending to reduce API usage
    send_batch_max_size: 200
    send_batch_size: 200
    timeout: 10s
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 20
  transform:
    metric_statements:
      - context: datapoint
        statements:
          - set(attributes["exported_location"], attributes["location"])
          - delete_key(attributes, "location")
          - set(attributes["exported_cluster"], attributes["cluster"])
          - delete_key(attributes, "cluster")
          - set(attributes["exported_namespace"], attributes["namespace"])
          - delete_key(attributes, "namespace")
          - set(attributes["exported_job"], attributes["job"])
          - delete_key(attributes, "job")
          - set(attributes["exported_instance"], attributes["instance"])
          - delete_key(attributes, "instance")
          - set(attributes["exported_project_id"], attributes["project_id"])
          - delete_key(attributes, "project_id")
    metrics:
      receivers: [otlp]
      processors: [ memory_limiter, transform, batch ]
      exporters: [ googlemanagedprometheus ]

open-telemetry / opentelemetry-collector-contrib

Can't avoid duplicate metrics in short lived cloud functions e.g. GCP Cloud Function #35522

Component(s)

Describe the issue you're reporting