influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.1k stars 5.51k forks source link

AWS S3 Output Plugin #15547

Open IvanoCar opened 1 week ago

IvanoCar commented 1 week ago

Use Case

I would like to open a pull request so I get input from the community. The use case here is output metrics from inputs and then output the metrics on a S3 bucket on a specific path.

Having data ingested via Telegraf on S3 which is used, for example, as a datalake is useful because it can be used in various analytics purposes and can be considered as enrichment of data already available from various other sources.

Expected behavior

I expect files to be written to S3 bucket and specified subfolders. Auth can be handled via IAM user on AWS.

Actual behavior

This currently is not supported in Telegraf.

Additional info

Config could look like this:

[[outputs.s3]]
  bucket = "bucketname"
  ## Auth
  access_key = "your-access-key"
  secret_key = "your-secret-key"
  ## Optional: always the same subfolder within the bucket (blank or subfolder path)
  subfolder = ""

  ## Optional: another subfolder based on time (YYYY-MM-DD_HH-MM-SS)
  ## Can be used together with subfolder or alone
  ## Can be: day, hour, minute, second or blank
  time_granularity_subfolder = ""

  ## Data format to output.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
  data_format = "influx"
srebhan commented 4 days ago

@IvanoCar please test the binary in PR #15569, available once CI finished the tests, and let me know if that works for you. You should be able to start from this config

# Send telegraf metrics to file(s) in a remote filesystem
[[outputs.remotefile]]
  ## Remote location according to https://rclone.org/#providers
  ## Check the backend configuration options and specify them in
  ##   <backend type>[,<param1>=<value1>[,...,<paramN>=<valueN>]]:[root]
  ## for example:
  remote = 's3,provider=AWS,access_key_id=your-access-key,secret_access_key=your-secret-key,session_token=your-token,region=eu-north-1:mybucket'

  ## Files to write in the remote location
  ## Each file can be a Golang template for generating the filename from metrics.
  ## See https://pkg.go.dev/text/template) for a reference and use the metric
  ## name (`{{.Name}}`), tag values (`{{.Tag "name"}}`), field values
  ## (`{{.Field "name"}}`) or the metric time (`{{.Time}}) to derive the
  ## filename.
  files = ['{{.Name}}-{{.Time.Format "2006-01-02"}}']

  ## Use batch serialization format instead of line based delimiting.
  ## The batch format allows for the production of non-line-based output formats
  ## and may more efficiently encode metrics.
  # use_batch_format = false

  ## Cache settings
  ## Time to wait for all writes to complete on shutdown of the plugin.
  # final_write_timeout = "10s"

  ## Time to wait between writing to a file and uploading to the remote location
  # cache_write_back = "5s"

  ## Maximum size of the cache on disk (infinite by default)
  # cache_max_size = -1

  ## Data format to output.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
  data_format = "influx"
IvanoCar commented 3 days ago

Hi @srebhan , thanks for your response!

I have tested it out for AWS and noticed a few things:

  1. adding folder name before the current naming you suggested is not possible (fails), but I guess this could be reolved by creating it in the Write method if the folder does not exist on the remote. Something like this:

    files = ['sessions/{{.Name}}-{{.Time.Format "2006-01-02"}}']
  2. Auth is not being verified in the Connect method, as logs state that connection is as success but later fails during Write.

  3. It is very useful to have the time of actual metric in the filename, but It would be also useful to have an option (flag) which would use the current time (time of arrival of the metric) in the filename.

  4. Rclone does not seem to be as stable as using aws-sdk directly (which I started implementing), which under the same load for the same agent configuration has some errors (below).

    On the other hand it is faster (around 80 ms for 2000 metrics vs 150ms ) than aws-sdk. Only difference is that i am using output.s3 like in my original suggestion.

    Errors:

    2024-06-28T08:31:33Z D! [outputs.remotefile]  Buffer fullness: 1000 / 80000 metrics
    2024-06-28T08:31:35Z I! ERROR : session_data-2023-09-08: Failed to copy: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z I! ERROR : session_data-2023-09-08: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z I! ERROR : session_data-2023-06-07: Failed to copy: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z I! ERROR : session_data-2023-06-07: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z I! ERROR : session_data-2024-02-02: Failed to copy: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z I! ERROR : session_data-2024-02-02: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z I! ERROR : session_data-2023-06-09: Failed to copy: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z I! ERROR : session_data-2023-06-09: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z I! ERROR : session_data-2023-10-03: Failed to copy: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z I! ERROR : session_data-2023-10-03: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z I! ERROR : session_data-2023-06-01: Failed to copy: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z I! ERROR : session_data-2023-06-01: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z I! ERROR : session_data-2024-01-05: Failed to copy: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z I! ERROR : session_data-2024-01-05: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
    caused by: context canceled
    2024-06-28T08:31:35Z D! [outputs.remotefile]  Wrote batch of 2000 metrics in 117.708198ms
    2024-06-28T08:31:35Z D! [outputs.remotefile]  Buffer fullness: 1500 / 80000 metrics
    2024-06-28T08:31:37Z D! [outputs.remotefile]  Wrote batch of 2000 metrics in 102.100369ms
    2024-06-28T08:31:37Z D! [outputs.remotefile]  Buffer fullness: 2000 / 80000 metrics
    2024-06-28T08:31:37Z D! [outputs.remotefile]  Wrote batch of 2000 metrics in 87.278875ms
    2024-06-28T08:31:37Z D! [outputs.remotefile]  Buffer fullness: 0 / 80000 metrics
    2024-06-28T08:31:39Z D! [outputs.remotefile]  Wrote batch of 2000 metrics in 105.098917ms

Also something like this is also visible in the logs:

    "session_data-2022-06-27": &{c:0xc00022f680 mu:{state:0 sema:0} cond:{noCopy:{} L:0xc0030a3108 notify:{wait:0 notify:0 lock:0 head:<nil> tail:<nil>} checker:824684720448} name:session_data-2022-06-27 opens:0 downloaders:<nil> o:0xc004179a70 fd:<nil> info:{ModTime:{wall:13949824168614839499 ext:387020864947 loc:0xf311c00} ATime:{wall:13949824168614842398 ext:387020867846 loc:0xf311c00} Size:29200 Rs:[{Pos:0 Size:29200}] Fingerprint:18241,2024-06-28 08:32:08.375413308 +0000 UTC,75f1fa334bf15a8e09f82f99c7d7f95d Dirty:true} writeBackID:234 pendingAccesses:0 modified:false beingReset:false},
    "session_data-2023-03-04": &{c:0xc00022f680 mu:{state:0 sema:0} cond:{noCopy:{} L:0xc00319f008 notify:{wait:0 notify:0 lock:0 head:<nil> tail:<nil>} checker:824685752384} name:session_data-2023-03-04 opens:0 downloaders:<nil> o:<nil> fd:<nil> info:{ModTime:{wall:13949824167155078702 ext:385708587798 loc:0xf311c00} ATime:{wall:13949824167155083386 ext:385708592481 loc:0xf311c00} Size:46425 Rs:[{Pos:0 Size:46425}] Fingerprint: Dirty:true} writeBackID:326 pendingAccesses:0 modified:false beingReset:false},
    "session_data-2023-06-15": &{c:0xc00022f680 mu:{state:0 sema:0} cond:{noCopy:{} L:0xc002cc1b08 notify:{wait:0 notify:0 lock:0 head:<nil> tail:<nil>} checker:824680651584} name:session_data-2023-06-15 opens:0 downloaders:<nil> o:0xc004384000 fd:<nil> info:{ModTime:{wall:13949824164969924144 ext:383670916885 loc:0xf311c00} ATime:{wall:13949824164969929009 ext:383670921751 loc:0xf311c00} Size:26130 Rs:[{Pos:0 Size:26130}] Fingerprint:26130,2024-06-28 08:32:44.8085356 +0000 UTC,8b130d7cc8e963678db7ffd89c6218b7 Dirty:false} writeBackID:58 pendingAccesses:0 modified:false beingReset:false},

Config I have been using for the test:

[agent]
  interval = "20s"
  round_interval = true
  metric_batch_size = 2000
  metric_buffer_limit = 80000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  hostname = ""
  omit_hostname = true
  debug = true

[[influx]]

[[outputs.remotefile]]
  remote = 's3,provider=AWS,access_key_id=<>,secret_access_key=<>,region=eu-west-1:bucket'
  files = ['{{.Name}}-{{.Time.Format "2006-01-02"}}']
  data_format = "influx"
[[inputs.http_listener_v2]]
  service_address = ":8186"
  paths = ["/write"]
  methods = ["POST"]
  basic_username = "test"
  basic_password = "test"
  data_format = "influx"

Unfortunately, I do not have the capacity at the moment to refine and suggest an official PR, probably sometime in the future :smiley:

srebhan commented 3 days ago

@IvanoCar first of all thanks for your valuable feedback! Let me address your points one-by-one...

I've chosen the rclone library as it supports different providers and allows to add other remote filesystems as well. IMO there is no point in reimplementing all this by ourselves... The "errors" you are seeing are internal logs of the underlying library denoting that fast multi-part uploads failed, however those errors are handled internally with retries, so nothing to worry about.

Regarding your items:

  1. Fixed in the updated PR. I simply forgot to create the directories... :-)
  2. Fixed in the updated PR.
  3. Added a now function to be used in the template. You can do {{now.Format "2006-01-02"}} with the updated PR using the current time instead of the metric time...
  4. As I outlined above, the "errors" are not actually errors but rather internal logs not leading to data-loss. I (hopefully) silenced those with the updated PR.
IvanoCar commented 21 hours ago

Hi @srebhan, happy to contribute!

I have tested with all the changes, it works great! 1-3 points all work as expected, the underlying logs are gone about retries, but hopefully errors will be visible in the log if they actually happen after the retry policy is depleted (I didnt go into it deeply on how the rclone is working in that regard).

I would maybe add info in sample config - the now example in the template, but I guess since it is added in the README its fine, not sure what is the convention about that :smiley:. Tnx and nice work!

srebhan commented 1 hour ago

@IvanoCar errors should be logged if writing fails. Will add an example for now in the README...