New component: AWS S3 Receiver

adcharre commented 5 months ago

The purpose and use-cases of the new component

The S3 receiver will allow the retrieval and processing of telemetry data previously stored in S3 by the AWS S3 Exporter. This will make it possible to retrieve data previously cold stored in S3 and allow us to investigate issues not reported within the time span data is available in our Observability service provider.

Example configuration for the component

receivers:
  awss3:
    s3downloader:
      s3_bucket: abucket
      s3_prefix: tenant_a
      s3_partition: minute
    starttime: "2024-01-13 15:00"
    endtime: "2024-01-21 15:00"

Telemetry data types supported

traces
metrics
logs

Is this a vendor-specific component?

[ ] This is a vendor-specific component
[ ] If this is a vendor-specific component, I am proposing to contribute and support it as a representative of the vendor.

Code Owner(s)

adcharre

Sponsor (optional)

@atoulme

Additional context

No response

atoulme commented 5 months ago

I'm interested to learn more. Would this be something you'd be able to checkpoint on?

adcharre commented 5 months ago

Would this be something you'd be able to checkpoint on?

@atoulme certainly, it's something I'm actively looking into at the moment it so makes sense to me get a second opinion on the best way to implement this and hopefully accepted. How best to organise?

atoulme commented 4 months ago

For all components, we tend to work with folks through CONTRIBUTING.md. The question I asked you earlier is in earnest - one of the thorny issues around a component reading from a remote source is to have a checkpoint mechanism that allows you to know where you stopped. We can use the storage extension for that purpose.

I am happy to sponsor this component if you'd like to work on it.

adcharre commented 4 months ago

Ahh, I understand now! Thank you for clarification and yes that is an issue I have been thinking about - how best to signal that ingest is finished. I'll look into the storage extension and get a PR up with the skeleton of the receiver.

rhysxevans commented 3 months ago

Hi, apologies I may be hijacking this thread.

Has there been any thought around integrating the S3 receiver, with SQS and S3 Event notifications ?

Our use case is we cannot directly write to an OTEL reciever in all cases, however we can write to a S3 bucket. We would then like the object event notification to notify SQS, where we could have a OTEL collector (or set of them) "listening" and on notification fetch the uploaded file and then output it into hte OTLP backend store. We coudl then also retain the source data in S3 and leverage the current features of this reciever to replay data if required.

An example sender may look something like:

    receivers:
      otlp:
        protocols:
          http:
            endpoint: 0.0.0.0:4318
            cors:
              allowed_origins:
                - "http://*"
                - "https://*"

    exporters:
      awss3:
        s3uploader:
            region: us-west-2
            s3_bucket: "tempo-traces-bucket"
            s3_prefix: 'metric'
            s3_partition: 'minute'

    processors:
      batch:
        send_batch_size: 10000
        timeout: 30s
      resource:
        attributes:
        - key: service.instance.id
          from_attribute: k8s.pod.uid
          action: insert
      memory_limiter:
        check_interval: 5s
        limit_mib: 200

    service:
      pipelines:
        traces:
          processors: [memory_limiter, resource, batch]
          exporters: [awss3, spanmetrics]

The reciever could poss look something like:

    receivers:
      awss3:
        sqs:
          queue_url: "https://sqs.us-west-1.amazonaws.com/<account_id>/queue"

    exporters:
      otlp:
        endpoint: 'http://otlp-endpoint:4317'

    processors:
      batch:
        send_batch_size: 10000
        timeout: 30s
      memory_limiter:
        check_interval: 5s
        limit_mib: 200

    service:
      pipelines:
        traces:
          processors: [memory_limiter, batch]
          exporters: [otlp, spanmetrics]

Thoughts ?

S3 Event Notifications: https://docs.aws.amazon.com/AmazonS3/latest/userguide/EventNotifications.html

szechyjs commented 3 months ago

It would be nice to be able to have this run continuously instead of specifying start/end times. This would help with shipping traces across clusters/accounts.

flowchart LR
  subgraph env1
  app1 --> env1-collector
  app2 --> env1-collector
  end
  env1-collector --> S3[(S3)]
  subgraph env2
  app3 --> env2-collector
  app4 --> env2-collector
  end
  env2-collector --> S3
  subgraph shared-env
  S3 --> shared-collector
  end

awesomeinsight commented 3 months ago

It would be nice to be able to have this run continuously instead of specifying start/end times. This would help with shipping traces across clusters/accounts.
flowchart LR
  subgraph env1
  app1 --> env1-collector
  app2 --> env1-collector
  end
  env1-collector --> S3[(S3)]
  subgraph env2
  app3 --> env2-collector
  app4 --> env2-collector
  end
  env2-collector --> S3
  subgraph shared-env
  S3 --> shared-collector
  end

Fully agree,

we also have scenarios where a receiver should constantly process new uploads (from S3Exporter) on an S3 bucket. Means without specifying starttime and endtime but having a checkpoint where it last stopped reading.

adcharre commented 3 months ago

@awesomeinsight / @rhysxevans - I see no reason why the receiver could not be expanded to include the scenario you suggest. At the moment I'm focusing on getting the initial implementation merged which focuses on my main use case of restoring data between a set of dates.

worksForM3 commented 2 months ago

@awesomeinsight / @rhysxevans - I see no reason why the receiver could not be expanded to include the scenario you suggest. At the moment I'm focusing on getting the initial implementation merged which focuses on my main use case of restoring data between a set of dates.

If the receiver would be expanded at some point to constantly process new uploads made by the S3Exporter, could it be used to buffer data independently of a file system? The idea would be to have an alternative to https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/exporterhelper#persistent-queue.

The idea is to have a resilient setup of exporters + importers (with s3 in between as buffer) which run stateless, as they would not require any filesystem to buffer data to disks.

Do you think a setup like this would make sense?

github-actions[bot] commented 2 days ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

open-telemetry / opentelemetry-collector-contrib