open-telemetry / opentelemetry-lambda

Create your own Lambda Layer in each OTel language using this starter code. Add the Lambda Layer to your Lamdba Function to get tracing with OpenTelemetry.
https://opentelemetry.io
Apache License 2.0
272 stars 166 forks source link

OTEL Python does not always flush metrics to awsemf #851

Open sarwaan001 opened 1 year ago

sarwaan001 commented 1 year ago

Describe the bug OTEL Python Layer does not always flush metrics at the end of lambda invocation.

Steps to reproduce

  1. Deploy a lambda with the following python code: handler.py
    
    """Sample Lambda for testing"""
    from opentelemetry.metrics import get_meter
    from opentelemetry import trace

trace.get_tracer_provider() tracer = trace.get_tracer(name)

meter = get_meter(name)

counter = meter.create_counter(name="invocation_counter", description="A counter metric", unit="invocations")

def lambdahandler(event, ): """Sample Lambda for testing""" counter.add(1) return {"status_code": 200}


config.yaml
```yaml
#collector.yaml in the root directory
#Set an environemnt variable 'OPENTELEMETRY_COLLECTOR_CONFIG_FILE' to '/var/task/collector.yaml'

receivers:
  otlp:
    protocols:
      grpc:
      http:
exporters:
  logging:
    verbosity: detailed
  awsxray:
  awsemf:
    namespace: ${env:OTEL_NAMESPACE}
    dimension_rollup_option: 1
    resource_to_telemetry_conversion:
      enabled: false
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [awsxray]
    metrics:
      receivers: [otlp]
      exporters: [logging,awsemf]

Ensure that the following configuration for the lambda is set:

Ensure the lamdba has the following permissions:

  1. Obtain the lambda arn

  2. Ensure that you are logged in to aws cli

  3. Create the following pytest and replace the lambda arn with the lamdba that was just created. test.py

    """
    Tests the following Lambda by invoking the lambda 100 times and expecting the counter to return 100.
    """
    import boto3
    import json
    from datetime import datetime
    import time
    def test_sample_lambda():
    lambda_arn = "<insert lambda arn>"
    
    lambda_client = boto3.client('lambda')
    event = json.dumps({})
    
    start_time = datetime.now()
    
    for i in range(100):
        response = lambda_client.invoke(
            FunctionName=lambda_arn,
            InvocationType='Event',
            LogType='None',
            Payload=event
        )
        assert response['StatusCode'] == 202
    
    # Wait 2 minutes for metrics to propagate + wait for last lambda
    time.sleep(2*60 + 2)
    
    cloudwatch_client = boto3.client('cloudwatch')
    
    metric_data = cloudwatch_client.get_metric_data(
        MetricDataQueries = [
            {
                'Id': 'integration_test',
                'MetricStat': {
                    'Metric': {
                        'Namespace': "SampleNamespace",
                        'MetricName': "invocation_counter",
                        'Dimensions': [{'Name': 'OTelLib', 'Value': 'handler'}]
                    },
                    'Period': 300,
                    'Stat': "Sum",
                }
            }
        ],
        StartTime=start_time,
        EndTime=datetime.now(),
    )
    
    otel_values = sum(metric_data['MetricDataResults'][0]['Values'])
    
    assert otel_values == 100

    ensure you have boto3 installed

  4. run pytest

What did you expect to see? There should be 100 values in cloudwatch. pytest should pass

What did you see instead? Less than 100 values sent to cloudwatch, sometimes 100 on warm lambdas and the test passes.

What version of collector/language SDK version did you use? arn:aws:lambda:us-east-1:901920570463:layer:aws-otel-python-amd64-ver-1-18-0:1

What language layer did you use? Python

Additional context I believe that sometimes the lambda layer does not flush emf metrics before the lambda freezes.

stevemao commented 7 months ago

I do not see anything going to awsemf at all. I am able to see logs when using logging exporter with the same code.

serkan-ozal commented 1 week ago

Hi @sarwaan001, I see that you set the flush timeout to 900 ms and I think his might not be enough (for functions will small memory limit) on coldstart because total flush timeout is shared between traces first and then metrics.