OTEL Python does not always flush metrics to awsemf

sarwaan001 commented 1 year ago

Describe the bug OTEL Python Layer does not always flush metrics at the end of lambda invocation.

Steps to reproduce

Deploy a lambda with the following python code: handler.py


"""Sample Lambda for testing"""
from opentelemetry.metrics import get_meter
from opentelemetry import trace

trace.get_tracer_provider() tracer = trace.get_tracer(name)

meter = get_meter(name)

counter = meter.create_counter(name="invocation_counter", description="A counter metric", unit="invocations")

def lambdahandler(event, ): """Sample Lambda for testing""" counter.add(1) return {"status_code": 200}


config.yaml
```yaml
#collector.yaml in the root directory
#Set an environemnt variable 'OPENTELEMETRY_COLLECTOR_CONFIG_FILE' to '/var/task/collector.yaml'

receivers:
  otlp:
    protocols:
      grpc:
      http:
exporters:
  logging:
    verbosity: detailed
  awsxray:
  awsemf:
    namespace: ${env:OTEL_NAMESPACE}
    dimension_rollup_option: 1
    resource_to_telemetry_conversion:
      enabled: false
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [awsxray]
    metrics:
      receivers: [otlp]
      exporters: [logging,awsemf]

Ensure that the following configuration for the lambda is set:

Environment -- AWS_LAMBDA_EXEC_WRAPPER: /opt/otel-instrument -- OPENTELEMETRY_COLLECTOR_CONFIG_FILE: /var/task/config.yaml -- OTEL_INSTRUMENTATION_AWS_LAMBDA_FLUSH_TIMEOUT: 900 -- OTEL_NAMESPACE: SampleNamespace -- OTEL_PROPAGATORS: xray -- OTEL_PYTHON_ID_GENERATOR: xray
Runtime - 3.9
Architecture - x86_64
handler: handler.lambda_handler
layers: arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-python-amd64-ver-1-18-0:1

Ensure the lamdba has the following permissions:

xray:PutTelemetryRecords
xray:PutTraceSegments
cloudwatch:GetMetricData
cloudwatch:GetMetricStatistics
cloudwatch:GetMetricStream
cloudwatch:PutMetricData
cloudwatch:PutMetricStream
cloudwatch:StartMetricStreams
logs:CreateLogGroup
logs:CreateLogStream
logs:PutLogEvents

Obtain the lambda arn
Ensure that you are logged in to aws cli

Create the following pytest and replace the lambda arn with the lamdba that was just created. test.py

"""
Tests the following Lambda by invoking the lambda 100 times and expecting the counter to return 100.
"""
import boto3
import json
from datetime import datetime
import time
def test_sample_lambda():
lambda_arn = "<insert lambda arn>"

lambda_client = boto3.client('lambda')
event = json.dumps({})

start_time = datetime.now()

for i in range(100):
    response = lambda_client.invoke(
        FunctionName=lambda_arn,
        InvocationType='Event',
        LogType='None',
        Payload=event
    )
    assert response['StatusCode'] == 202

# Wait 2 minutes for metrics to propagate + wait for last lambda
time.sleep(2*60 + 2)

cloudwatch_client = boto3.client('cloudwatch')

metric_data = cloudwatch_client.get_metric_data(
    MetricDataQueries = [
        {
            'Id': 'integration_test',
            'MetricStat': {
                'Metric': {
                    'Namespace': "SampleNamespace",
                    'MetricName': "invocation_counter",
                    'Dimensions': [{'Name': 'OTelLib', 'Value': 'handler'}]
                },
                'Period': 300,
                'Stat': "Sum",
            }
        }
    ],
    StartTime=start_time,
    EndTime=datetime.now(),
)

otel_values = sum(metric_data['MetricDataResults'][0]['Values'])

assert otel_values == 100

ensure you have boto3 installed

run pytest

What did you expect to see? There should be 100 values in cloudwatch. pytest should pass

What did you see instead? Less than 100 values sent to cloudwatch, sometimes 100 on warm lambdas and the test passes.

What version of collector/language SDK version did you use? arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-python-amd64-ver-1-18-0:1

What language layer did you use? Python

Additional context I believe that sometimes the lambda layer does not flush emf metrics before the lambda freezes.

stevemao commented 7 months ago

I do not see anything going to awsemf at all. I am able to see logs when using logging exporter with the same code.

serkan-ozal commented 1 week ago

Hi @sarwaan001, I see that you set the flush timeout to 900 ms and I think his might not be enough (for functions will small memory limit) on coldstart because total flush timeout is shared between traces first and then metrics.

Are you still seeing missing metrics with higher flush timeout configs?
And also I couldn't see your collector layer ARN. Are you exporting to the collector outside?
Additionally, you should use decouple processor (https://github.com/open-telemetry/opentelemetry-lambda/blob/main/collector/processor/decoupleprocessor/README.md) in the collector to be aligned with Lambda lifecycle. Otherwise, because of container freeze, some metrics might be missing or delayed.

open-telemetry / opentelemetry-lambda

OTEL Python does not always flush metrics to awsemf #851