open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.13k stars 2.41k forks source link

Add an S3 exporter #2835

Closed rakyll closed 1 year ago

rakyll commented 3 years ago

Various Otel users want to export raw telemetry data for long-term storage and analysis. We should add an S3 exporter that exports the incoming OTLP data points to sharded S3 objects in a bucket. Sharding can be done per telemetry type (metrics, traces, logs) and time (window to be configured by the user). Also we can consider serializing into Ion at the collector optionally.

(You can assign this issue to me.)

anuraaga commented 3 years ago

@rakyll What format are you thinking for storage? There's some discussion on https://github.com/open-telemetry/opentelemetry-specification/issues/1443 but I don't think we have a good format defined for stored traces yet.

bogdandrutu commented 3 years ago

This can also be a plugin for fileexporter correct? It is just that you write into a remote "file".

rakyll commented 3 years ago

Once https://github.com/open-telemetry/opentelemetry-specification/issues/1443 is addressed, we should follow the specification there. Otherwise, the format needs discussion/proposal.

@bogdandrutu It's a good idea. Given it's vendor-specific, my initial suggestion was a standalone exporter. But, we can ask other vendors to implement support for their services and provide the functionality as a part of the fileexporter.

sirianni commented 3 years ago

What format are you thinking for storage?

Should it be a design goal to pick a format (Parquet, ORC, etc.) (or have a pluggable format) that allows for the data to be queried "in-place" by a system like AWS Athena?

emeraldbay commented 3 years ago

Given the various data formats we have, can we first start build a few popular format like Parquet/ORC/JSON?

emeraldbay commented 3 years ago

Initial proposal for the S3 exporter design requirements:

Exporter feature: 1). support metrics exporter as start, support log and traces later

File output format related feature: 1). Use snappy as basic coding format 2). Support parquet and compressed json format as start, support ORC later 3). Output file format is configurable through exporter config 4). Output schema is configurable through a json config file 5). Support specify the pdata.Metrics fields to output schema mapping through config file

S3 uploader related feature: 1). User should specify the S3 bucket name, S3 key prefix, output file name prefix, time partition granularity (in hour or in minute) The final S3 key will be in the format: s3://s3_bucket_name/s3_key_prefix/year=XXX/month=XX/day=XX/hour=XX/{min=XX/}file_nameprefix{random_id}.file_format 2). Support basic S3 uploader configuration options (PartSize/Concurrency) as specified in https://aws.github.io/aws-sdk-go-v2/docs/sdk-utilities/s3/

jrcamp commented 3 years ago

Could this be a generic object storage using a library like https://github.com/chartmuseum/storage (not advocating this particular one, just as an example). Authentication methods may vary across cloud providers so maybe config options may still need to be cloud specific or something like that but using a library that already abstracts across the providers may still make it easier to extend to other providers. It even has one for local usage so makes sense to extend file exporter like @bogdandrutu suggested https://github.com/chartmuseum/storage/blob/main/local.go.

emeraldbay commented 3 years ago

Could this be a generic object storage using a library like https://github.com/chartmuseum/storage (not advocating this particular one, just as an example). Authentication methods may vary across cloud providers so maybe config options may still need to be cloud specific or something like that but using a library that already abstracts across the providers may still make it easier to extend to other providers. It even has one for local usage so makes sense to extend file exporter like @bogdandrutu suggested https://github.com/chartmuseum/storage/blob/main/local.go.

But I think this exporter is specific to S3

emeraldbay commented 3 years ago

Initial code to kick off the discussion

https://github.com/emeraldbay/opentelemetry-collector-contrib/commit/faf3268e9bbad8973e776e75685d0c261ae02f5e

jrcamp commented 3 years ago

@emeraldbay why must it be s3 specific?

knvpk commented 3 years ago

This feature is very nice, so that we can archive so much of instrumentation data but still queryable.

knvpk commented 2 years ago

what is the progress on this feature?

atoulme commented 2 years ago

Hey folks, I added a file exporter here: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/6712 This is just outputting pipeline data as JSON.

atoulme commented 2 years ago

Because this format is not well suited for s3, I have started working on Parquet support for OpenTelemetry. Here is a draft PR: https://github.com/open-telemetry/opentelemetry-proto/pull/346

Once OpenTelemetry data can be serialized in Parquet support, we can create a receiver and exporter for Parquet files.

jpkrohling commented 2 years ago

I'm open to being a sponsor for the Parquet components on contrib. I think it would be a great addition.

atoulme commented 2 years ago

That is great to hear! I'll try and stick to the approach of creating the component structure first. @jpkrohling please see here: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/6903

atoulme commented 2 years ago

Folks, https://github.com/open-telemetry/opentelemetry-proto/pull/346 is ready for review. It is coming together. It is not a final Parquet mapping by any means but it gets us to a suitable support to experiment.

jmcarp commented 2 years ago

Is the idea that the stub parquet exporter in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/6903 would eventually learn more formats and destinations, as in https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/6903#discussion_r805486291? Or would it be simpler to have one exporter to export text files to disk, another to export parquet files to disk, another to export parquet/json files to s3, etc.? I would be interested in exporting to both s3 and gcs, and happy to write a patch if it would be helpful.

pdelewski commented 2 years ago

Recently, I started working on s3 exporter using this work as a baseline. Is anyone working on that actively?

jpkrohling commented 2 years ago

I don't think anyone is working on this. If you decide to work on this, make sure you have a sponsor first, in order for this component to be accepted as part of the contrib repository.

jmacd commented 2 years ago

:wave: I have a personal interest in seeing OTC be able to write telemetry to S3. I feel that not many vendors have this interest, but for a user who wants to store years worth of telemetry and cares about costs, S3 looks like a good path forward.

atoulme commented 2 years ago

I work on a parquet exporter, but no time recently.

pmm-sumo commented 2 years ago

I volunteer to be a sponsor of the initiative. I think parquet exporter would be very useful and it's great to see the work started, though it's probably a somewhat separate item (essentially serialization mechanism). So I think it might be e.g. in some common package and then referenced by fileexporter and s3 exporter.

GrahamLea commented 2 years ago

Hi @pdelewski.

I'm interested in the work you're doing here on the S3 exporter. I'm wondering if you could help me understand a small detail. (I checked your README and couldn't find the answer.)

Considering that an OTel trace exporter is usually receiving batches of ResourceSpans (I think?), not correlated traces, does your S3 exporter do anything in the way it stores telemetry to attempt to correlate spans into traces? Or is it more like just a timestamped dump of what the exporter received?

Thanks.

GrahamLea commented 2 years ago

Re: correlation question ☝️, I think I found the answer in the code. It looks like the S3 key is partly based on the current time when the Object is written to S3. I guess that would mean related spans may share a folder, if saved at around the same time, but may not, and there's no attempt at traceId-based correlation that I could see.

https://github.com/pdelewski/s3-exporter/blob/df7ffe6ecc734a4a92cb78ca608d9c8ae457fee2/exporter/awss3exporter/s3_writer.go#L61-L64

pdelewski commented 2 years ago

Re: correlation question ☝️, I think I found the answer in the code. It looks like the S3 key is partly based on the current time when the Object is written to S3. I guess that would mean related spans may share a folder, if saved at around the same time, but may not, and there's no attempt at traceId-based correlation that I could see.

https://github.com/pdelewski/s3-exporter/blob/df7ffe6ecc734a4a92cb78ca608d9c8ae457fee2/exporter/awss3exporter/s3_writer.go#L61-L64

Yes, You are correct. I'm not doing correlation right now except writing them to folder according to time. The idea was to start with something small and get some feedback for user needs and improvements

ghost commented 2 years ago

Would be nice if this would also cover logs export to object storage.

update: noticing the s3 exporter at least is designed to export logs as well, though i am not a fan of the path scheme; doesn't feel right to have logs under a metrics path segment.

pdelewski commented 2 years ago

@CatalinAdlerDF object storage is out of scope for now. Regarding path scheme, currently logs as well as traces will land in the same location (metrics are out of scope for now). We can consider splitting content to different locations

codeboten commented 2 years ago

pmm-sumo is no longer involved in the project, a new sponsor is needed before the s3 exporter component will be merged into the contrib repo.

anarwal commented 2 years ago

@codeboten what is the process to become a sponsor?

perk-sumo commented 2 years ago

@anarwal afaik only maintainers can sponsor new exporters.

Information on how to become one can be found here: https://github.com/open-telemetry/community/blob/main/community-membership.md#becoming-a-maintainer

codeboten commented 2 years ago

@anarwal @perk-sumo approvers and maintainers can sponsors new components https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/CONTRIBUTING.md#adding-new-components

nerochiaro commented 1 year ago

I know this component is still marked in-development and not officially in contrib.

But to what extent can it already be used ? Would using it risk crashing the entire collector or at worse it won't do its job but leave everything else functioning ?

pdelewski commented 1 year ago

Component is in alfa status, however it should be pretty stable. To continue work on that we need a sponsor as mentioned above.

MovieStoreGuy commented 1 year ago

Hi all,

First of all, thank you for everyone being interested in seeing this implemented. I want to ask if there was a use case for a dedicated aws s3 exporter? Or would users be okay if this was an extension to the file exporter?

The latter has the advantage that it can be implemented sooner and be rather opinionated on how to handle S3 issues / format. The former means that other components could benefit from it by making it an extension?

I am happy to sponsor either, I would prefer the extension approach however, it depends on what everyone involved prefers?

MovieStoreGuy commented 1 year ago

It looks like an earlier discussion leaned towards an abstraction for the fileexporter.

https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/1963

pdelewski commented 1 year ago

@MovieStoreGuy Just to clarify, in the case of fileexporter, s3 storage would be treated as a kind of remote file system, right? Does fileexporter have similar extensions already?

guyfrid commented 1 year ago

First of all, thank you for everyone being interested in seeing this implemented. I want to ask if there was a use case for a dedicated aws s3 exporter? Or would users be okay if this was an extension to the file exporter?

Hi @MovieStoreGuy! we also have an interest in this exporter. our use case is to export and store spans to s3 bucket from a collector running locally inside the lambda handler. I think(please correct me if i'm wrong) Both approaches can work for this use case.

Is there any active work ongoing on this?

pdelewski commented 1 year ago

@guyfrid Work stopped due to lack of sponsor

erez111 commented 1 year ago

I'm looking also for S3 exporter.

Having difficulties using otlp with http+json protocol as metrics exporter.

I thought using S3 restful api directly using http+json can be a quick alternative.

All I can send is http+protobuf or grpc (which isn't supported by s3).

AIs someone implemented it and can sens a config.yaml file or investigate it as well. Thanks guys for this useful thread

philipcwhite commented 1 year ago

I'd like to see this as well. We have a lot of data we ship to S3. I think it would be useful. Thanks

atoulme commented 1 year ago

OK, I can sponsor this.

rakyll commented 1 year ago

If anyone wants to work on this, they should feel free to take it. Unfortunately I'm working on something else nowadays.

atoulme commented 1 year ago

Thank you for your help! We will get it done.

sudopras commented 1 year ago

Hi, @atoulme would like to know if you need any help with this. Interested in contributing and sponsoring (if required)

atoulme commented 1 year ago

No need for a sponsor, this PR is the latest: https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/9979

Once it's in, we have to get the impl right.

akshadha22 commented 1 year ago

I was unable to use awss3 exporter with the latest release because of the error:

error decoding 'exporters': unknown type: "awss3" for id: "awss3" (valid values: [alibabacloud_logservice clickhouse googlemanagedprometheus instana jaeger pulsar sapm f5cloud loadbalancing sumologic otlp azuremonitor datadog elasticsearch influxdb mezmo skywalking awscloudwatchlogs awsemf dynatrace googlecloud tanzuobservability tencentcloud_logservice logicmonitor opencensus signalfx splunk_hec googlecloudpubsub loki prometheusremotewrite otlphttp azuredataexplorer carbon file jaeger_thrift logzio kafka prometheus sentry zipkin logging awskinesis awsxray coralogix])

atoulme commented 1 year ago

The s3 exporter is not code complete yet from what I understand. We have just merged the skeleton of the exporter.

pdelewski commented 1 year ago

The s3 exporter is not code complete yet from what I understand. We have just merged the skeleton of the exporter.

That's true. There is a second PR with implementation https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/10000, however it requires some work as interfaces changed during last few months.

miguelb-gk commented 1 year ago

Any idea when this will make its way in? We are also seeing the same unknown error.