open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.9k stars 2.27k forks source link

Add "aws.ecs.task.id" detection to "resourcedetection" processor #8274

Open mkielar opened 2 years ago

mkielar commented 2 years ago

Is your feature request related to a problem? Please describe. I'm using OTEL as a sidecar with ECS Services. I use it to parse and filter StatsD Metrics, that AppMesh/Envoy produces, and then I use emfexporter to put the metrics to Cloudwatch via Cloudwatch Log Stream. This mostly works. However, when my ECS Service scales to multiple instances, I often see following error in my logs:

2022-03-07T10:05:33.439Z    warn    cwlogs@v0.43.0/cwlog_client.go:84   cwlog_client: Error occurs in PutLogEvents, will search the next token and retry the request    
{
    "kind": "exporter",
    "name": "awsemf/statsd/envoy_metrics",
    "error": "InvalidSequenceTokenException: The given sequenceToken is invalid. The next expected sequenceToken is: 49626189108498028449043455519612405404976381845984773650\n{\n  RespMetadata: {\n    StatusCode: 400,\n    RequestID: \"a24432dd-4d17-44ae-b245-3877cfffabb7\"\n  },\n  ExpectedSequenceToken: \"49626189108498028449043455519612405404976381845984773650\",\n  Message_: \"The given sequenceToken is invalid. The next expected sequenceToken is: 49626189108498028449043455519612405404976381845984773650\"\n}"
}

This is caused by race-condition - now, two nodes write to the same log-stream in cloudwatch, and they corrupt each ther's sequenceToken that AWS API Required to put logs to CloudWatch.

Describe the solution you'd like I was hoping to additionally configure resourcedetection processor:

  "resource":
    "attributes":
    - "action": "insert"
      "from_attribute": "aws.ecs.task.id"
      "key": "TaskId"
    - "action": "insert"
      "from_attribute": "aws.ecs.task.arn"
      "key": "TaskARN"
  "resourcedetection":
    "detectors":
    - "env"
    - "ecs"

so that I would be able to use the {TaskId} dynamic field when configuring emfexporter, like this:

"awsemf/statsd/envoy_metrics":
    "dimension_rollup_option": "NoDimensionRollup"
    "log_group_name": "/aws/ecs/dev/hello-world"
    "log_stream_name": "emf/otel/statsd/envoy_metrics/{TaskId}"
    "namespace": "dev/AppMeshEnvoy"

However, when I run my service, I can see that only the following is detected by resourcedetection:

2022-03-07T13:11:17.808Z    info    internal/resourcedetection.go:139   detected resource information   
{
    "kind": "processor",
    "name": "resourcedetection",
    "resource": {
        "aws.ecs.cluster.arn": "arn:aws:ecs:eu-west-1:506501033716:cluster/dev",
        "aws.ecs.launchtype": "fargate",
        "aws.ecs.task.arn": "arn:aws:ecs:eu-west-1:506501033716:task/dev/1a8d528834e046b183d4913feeaa16bc",
        "aws.ecs.task.family": "dev-hello-world",
        "aws.ecs.task.revision": "43",
        "cloud.account.id": "506501033716",
        "cloud.availability_zone": "eu-west-1a",
        "cloud.platform": "aws_ecs",
        "cloud.provider": "aws",
        "cloud.region": "eu-west-1"
    }
}

Describe alternatives you've considered Tried to use TaskARN, but that just lead to not having LogStream created at all. Most likely, the reason is that TaskARNs contain characters that are illegal for LogStream Name, the the emfexporter fails silently, not being able to create one.

Additional context N/A.

mkielar commented 2 years ago

I have found a working workaround - I condfigure resource processor with extract action, like this:

  "resource":
    "attributes":
    - "action": "extract"
      "key": "aws.ecs.task.arn"
      "pattern": "^arn:aws:ecs:(?P<Region>.*):(?P<AccountId>.*):task/(?P<ClusterName>.*)/(?P<TaskId>.*)$"

Then, I configure emfexporter like this:

"awsemf/statsd/envoy_metrics":
    "dimension_rollup_option": "NoDimensionRollup"
    "log_group_name": "/aws/ecs/dev/hello-world"
    "log_stream_name": "emf/otel/statsd/envoy_metrics/{TaskId}"
    "namespace": "dev/AppMeshEnvoy"

This makes emfexporter to create CloudWatch Logstreams for each task separately.

Still, not having to do this, would be the preferred way, especially that the docs for emfexporter explicitly state they accept both TaskId and aws.ecs.task.id for {TaskId} references in the Log Stream Name.

jpkrohling commented 2 years ago

cc @jrcamp @pmm-sumo @anuraaga @dashpole

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

mkielar commented 1 year ago

Bad, bad bot :P The issue is still relevant, and I'd still like to see that implemented. I believe having aws.ecs.task.id explicitly exported makes sense, and improves internal consistency of the whole software package, because it makes various components within the package (like emfexporter) cooperate better without having to do extra configuration steps.

tnsardesai commented 1 year ago

We also have a similar setup with app mesh + ecs fargate and are seeing the same issue. Thanks for also including the workaround @mkielar. I agree that aws.ecs.task.id should be explicitly exported especially because it is listed here https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/awsemfexporter#exporter-configuration.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

mkielar commented 1 year ago

Bad bot. The issue is still relevant and everything I wrote on Nov 16th still holds. Pinging @open-telemetry/collector-contrib-triagers, as suggested (hahah, it doesn't work :P).

github-actions[bot] commented 1 year ago

Pinging code owners for processor/resourcedetection: @Aneurysm9 @dashpole. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

mkielar commented 1 year ago

I guess I'm just gonna keep it alive then ;)

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

mkielar commented 1 year ago

Bad bot :P Pinging code owners: @Aneurysm9 @dashpole

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

jpkrohling commented 1 year ago

Code owners, are you able to provide feedback on this one?

cc @Aneurysm9 @dashpole

bryan-aguilar commented 1 year ago

I don't see any issues with this. Having aws.ecs.task.id could be useful in more than just this specific use case and would prevent users from having to do the extraction themselves as mentioned earlier.

@mkielar would you be willing to contribute a PR for this? I would expect that the implementation would be to extract the task-id out of the task arn similar to your regex.

github-actions[bot] commented 10 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

mkielar commented 9 months ago

Hi @bryan-aguilar, I finally found time to learn Go, and made an attempt on implementing this feature. The PR is here: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/29602 please have a look.

Also, pinging code owners: @Aneurysm9 @dashpole

github-actions[bot] commented 7 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 5 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.