Google Cloud Storage output plugin

stijn-vanbael-enprove commented 2 months ago

Use Case

I want to store incoming metrics directly into the cheapest form of storage we have, which on Google Cloud is Google Cloud Storage.

Example usage:

[[outputs.google_cloud_storage]]
  bucket = "my-bucket"
  data_format = "influx"
  credentials_file = "path/to/my/creds.json"
  metrics_per_object = 1
  group_by = "day"
  object_suffix = ".line"

Expected behavior

The configuration above will result in one object per metric in the bucket "my-bucket", with the following object name: <measurement>/<date>/<timestamp>.line

Actual behavior

There is no support for outputting to Google Cloud Storage yet.

Additional info

No response

powersj commented 2 months ago

Hi,

Some questions around the proposal:

Have you looked into how to manage credentials?

bucket = "my-bucket"

Would telegraf create the bucket or would we assume the user has created it?

metrics_per_object = 1

If you have 20 objects, would you then write 20 files at every interval? Likewise, if you have 10,000 metrics, 10,000 files? Rather than dividing shouldn't a plugin respect the batch format serializer setting instead.

//.line group_by = "day"

What are you assuming date would look like? 2005-01-02? Are you assuming telegraf would create and manage different folders and auto-create new ones? How does that relate to the group by?

Are you planning to submit a PR?

stijn-vanbael-enprove commented 2 months ago

Have you looked into how to manage credentials?

I assumed credentials would work in the same way as they do for the google_cloud_storage input plugin.

Would telegraf create the bucket or would we assume the user has created it?

Creating the bucket is not a hard requirement for me, but it would be nice if Telegraf could take care of it.

If you have 20 objects, would you then write 20 files at every interval? Likewise, if you have 10,000 metrics, 10,000 files? Rather than dividing shouldn't a plugin respect the batch format serializer setting instead.

Right, this is better handled by the serializer indeed.

What are you assuming date would look like? 2005-01-02? Are you assuming telegraf would create and manage different folders and auto-create new ones? How does that relate to the group by?

2005-01-02 would be a good format, but maybe it's better to have it configurable. Google Cloud Storage doesn't actually have folders. It just groups files for you in a folder-like structure when you use slashes in the object name.

Are you planning to submit a PR?

I'm afraid not

powersj commented 2 months ago

2005-01-02 would be a good format, but maybe it's better to have it configurable. Google Cloud Storage doesn't actually have folders. It just groups files for you in a folder-like structure when you use slashes in the object name.

Right, however, even in your original request you started given the objects a path, so I assume others would ask the same. We could do something similar to what we do in opensearch, where the index name there takes a Golang template.

What I am think then is a config like this:

[[outputs.google_cloud_storage]]
  ## Bucket
  ## Name of Cloud Storage bucket to send metrics to.
  bucket = ""

  ## Object name
  ## Target object name for metrics. This is a Golang template (see
  ## https://pkg.go.dev/text/template). You can also specify metric name
  ## (`{{.Name}}`), tag value (`{{.Tag "tag_name"}}`), field value
  ## (`{{.Field "field_name"}}`), or timestamp (`{{.Time.Format "xxxxxxxxx"}}`).
  ## If the tag does not exist, the default tag value will be empty string "".
  ##
  ## For example: "telegraf-{{.Time.Format \"2006-01-02\"}}-{{.Tag \"host\"}}" 
  ## would set it to `telegraf-2023-07-27-HostName`
  object_name = ""

  ## Data format to output
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
  # data_format = "influx"

  ## Credentials file
  ## Optional. File path for GCP credentials JSON file to authorize calls to
  ## Google Cloud Storage APIs. If not set explicitly, Telegraf will attempt to use
  ## Application Default Credentials, which is preferred.
  # credentials_file = "path/to/my/creds.json"

influxdata / telegraf