Open IvanoCar opened 1 week ago
@IvanoCar please test the binary in PR #15569, available once CI finished the tests, and let me know if that works for you. You should be able to start from this config
# Send telegraf metrics to file(s) in a remote filesystem
[[outputs.remotefile]]
## Remote location according to https://rclone.org/#providers
## Check the backend configuration options and specify them in
## <backend type>[,<param1>=<value1>[,...,<paramN>=<valueN>]]:[root]
## for example:
remote = 's3,provider=AWS,access_key_id=your-access-key,secret_access_key=your-secret-key,session_token=your-token,region=eu-north-1:mybucket'
## Files to write in the remote location
## Each file can be a Golang template for generating the filename from metrics.
## See https://pkg.go.dev/text/template) for a reference and use the metric
## name (`{{.Name}}`), tag values (`{{.Tag "name"}}`), field values
## (`{{.Field "name"}}`) or the metric time (`{{.Time}}) to derive the
## filename.
files = ['{{.Name}}-{{.Time.Format "2006-01-02"}}']
## Use batch serialization format instead of line based delimiting.
## The batch format allows for the production of non-line-based output formats
## and may more efficiently encode metrics.
# use_batch_format = false
## Cache settings
## Time to wait for all writes to complete on shutdown of the plugin.
# final_write_timeout = "10s"
## Time to wait between writing to a file and uploading to the remote location
# cache_write_back = "5s"
## Maximum size of the cache on disk (infinite by default)
# cache_max_size = -1
## Data format to output.
## Each data format has its own unique set of configuration options, read
## more about them here:
## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
data_format = "influx"
Hi @srebhan , thanks for your response!
I have tested it out for AWS and noticed a few things:
adding folder name before the current naming you suggested is not possible (fails), but I guess this could be reolved by creating it in the Write
method if the folder does not exist on the remote. Something like this:
files = ['sessions/{{.Name}}-{{.Time.Format "2006-01-02"}}']
Auth is not being verified in the Connect
method, as logs state that connection is as success but later fails during Write
.
It is very useful to have the time of actual metric in the filename, but It would be also useful to have an option (flag) which would use the current time (time of arrival of the metric) in the filename.
Rclone does not seem to be as stable as using aws-sdk
directly (which I started implementing), which under the same load for the same agent configuration has some errors (below).
On the other hand it is faster (around 80 ms for 2000 metrics vs 150ms ) than aws-sdk
. Only difference is that i am using output.s3
like in my original suggestion.
Errors:
2024-06-28T08:31:33Z D! [outputs.remotefile] Buffer fullness: 1000 / 80000 metrics
2024-06-28T08:31:35Z I! ERROR : session_data-2023-09-08: Failed to copy: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z I! ERROR : session_data-2023-09-08: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z I! ERROR : session_data-2023-06-07: Failed to copy: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z I! ERROR : session_data-2023-06-07: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z I! ERROR : session_data-2024-02-02: Failed to copy: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z I! ERROR : session_data-2024-02-02: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z I! ERROR : session_data-2023-06-09: Failed to copy: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z I! ERROR : session_data-2023-06-09: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z I! ERROR : session_data-2023-10-03: Failed to copy: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z I! ERROR : session_data-2023-10-03: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z I! ERROR : session_data-2023-06-01: Failed to copy: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z I! ERROR : session_data-2023-06-01: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z I! ERROR : session_data-2024-01-05: Failed to copy: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z I! ERROR : session_data-2024-01-05: vfs cache: failed to upload try #1, will retry in 10s: vfs cache: failed to transfer file from cache to remote: RequestCanceled: request context canceled
caused by: context canceled
2024-06-28T08:31:35Z D! [outputs.remotefile] Wrote batch of 2000 metrics in 117.708198ms
2024-06-28T08:31:35Z D! [outputs.remotefile] Buffer fullness: 1500 / 80000 metrics
2024-06-28T08:31:37Z D! [outputs.remotefile] Wrote batch of 2000 metrics in 102.100369ms
2024-06-28T08:31:37Z D! [outputs.remotefile] Buffer fullness: 2000 / 80000 metrics
2024-06-28T08:31:37Z D! [outputs.remotefile] Wrote batch of 2000 metrics in 87.278875ms
2024-06-28T08:31:37Z D! [outputs.remotefile] Buffer fullness: 0 / 80000 metrics
2024-06-28T08:31:39Z D! [outputs.remotefile] Wrote batch of 2000 metrics in 105.098917ms
Also something like this is also visible in the logs:
"session_data-2022-06-27": &{c:0xc00022f680 mu:{state:0 sema:0} cond:{noCopy:{} L:0xc0030a3108 notify:{wait:0 notify:0 lock:0 head:<nil> tail:<nil>} checker:824684720448} name:session_data-2022-06-27 opens:0 downloaders:<nil> o:0xc004179a70 fd:<nil> info:{ModTime:{wall:13949824168614839499 ext:387020864947 loc:0xf311c00} ATime:{wall:13949824168614842398 ext:387020867846 loc:0xf311c00} Size:29200 Rs:[{Pos:0 Size:29200}] Fingerprint:18241,2024-06-28 08:32:08.375413308 +0000 UTC,75f1fa334bf15a8e09f82f99c7d7f95d Dirty:true} writeBackID:234 pendingAccesses:0 modified:false beingReset:false},
"session_data-2023-03-04": &{c:0xc00022f680 mu:{state:0 sema:0} cond:{noCopy:{} L:0xc00319f008 notify:{wait:0 notify:0 lock:0 head:<nil> tail:<nil>} checker:824685752384} name:session_data-2023-03-04 opens:0 downloaders:<nil> o:<nil> fd:<nil> info:{ModTime:{wall:13949824167155078702 ext:385708587798 loc:0xf311c00} ATime:{wall:13949824167155083386 ext:385708592481 loc:0xf311c00} Size:46425 Rs:[{Pos:0 Size:46425}] Fingerprint: Dirty:true} writeBackID:326 pendingAccesses:0 modified:false beingReset:false},
"session_data-2023-06-15": &{c:0xc00022f680 mu:{state:0 sema:0} cond:{noCopy:{} L:0xc002cc1b08 notify:{wait:0 notify:0 lock:0 head:<nil> tail:<nil>} checker:824680651584} name:session_data-2023-06-15 opens:0 downloaders:<nil> o:0xc004384000 fd:<nil> info:{ModTime:{wall:13949824164969924144 ext:383670916885 loc:0xf311c00} ATime:{wall:13949824164969929009 ext:383670921751 loc:0xf311c00} Size:26130 Rs:[{Pos:0 Size:26130}] Fingerprint:26130,2024-06-28 08:32:44.8085356 +0000 UTC,8b130d7cc8e963678db7ffd89c6218b7 Dirty:false} writeBackID:58 pendingAccesses:0 modified:false beingReset:false},
Config I have been using for the test:
[agent]
interval = "20s"
round_interval = true
metric_batch_size = 2000
metric_buffer_limit = 80000
collection_jitter = "0s"
flush_interval = "10s"
flush_jitter = "0s"
precision = ""
hostname = ""
omit_hostname = true
debug = true
[[influx]]
[[outputs.remotefile]]
remote = 's3,provider=AWS,access_key_id=<>,secret_access_key=<>,region=eu-west-1:bucket'
files = ['{{.Name}}-{{.Time.Format "2006-01-02"}}']
data_format = "influx"
[[inputs.http_listener_v2]]
service_address = ":8186"
paths = ["/write"]
methods = ["POST"]
basic_username = "test"
basic_password = "test"
data_format = "influx"
Unfortunately, I do not have the capacity at the moment to refine and suggest an official PR, probably sometime in the future :smiley:
@IvanoCar first of all thanks for your valuable feedback! Let me address your points one-by-one...
I've chosen the rclone
library as it supports different providers and allows to add other remote filesystems as well. IMO there is no point in reimplementing all this by ourselves... The "errors" you are seeing are internal logs of the underlying library denoting that fast multi-part uploads failed, however those errors are handled internally with retries, so nothing to worry about.
Regarding your items:
now
function to be used in the template. You can do {{now.Format "2006-01-02"}}
with the updated PR using the current time instead of the metric time...Hi @srebhan, happy to contribute!
I have tested with all the changes, it works great! 1-3 points all work as expected, the underlying logs are gone about retries, but hopefully errors will be visible in the log if they actually happen after the retry policy is depleted (I didnt go into it deeply on how the rclone
is working in that regard).
I would maybe add info in sample config - the now
example in the template, but I guess since it is added in the README its fine, not sure what is the convention about that :smiley:. Tnx and nice work!
@IvanoCar errors should be logged if writing fails. Will add an example for now
in the README...
Use Case
I would like to open a pull request so I get input from the community. The use case here is output metrics from inputs and then output the metrics on a S3 bucket on a specific path.
Having data ingested via Telegraf on S3 which is used, for example, as a datalake is useful because it can be used in various analytics purposes and can be considered as enrichment of data already available from various other sources.
Expected behavior
I expect files to be written to S3 bucket and specified subfolders. Auth can be handled via IAM user on AWS.
Actual behavior
This currently is not supported in Telegraf.
Additional info
Config could look like this: