Closed Darth-Bobo closed 2 years ago
Hi,
Keep in mind that this repo contains container images for more than telegraf. I am going to assume you are using telegraf given that is the most recent version, but it would help to specify :)
We haven't made any changes to the container configuration or set up in the last few minor releases either.
To ensure you don't have any other processors, outputs, or other things that might be interacting with the the internal plugin can you confirm you see nothing with the following config:
[[inputs.internal]]
[[outputs.file]]
Apologies, I didn't notice this was a more general repo when I followed the link from the telegraf docker pages.
Yes, Telegraf.
Yes, I see nothing with that config on 1.24.2 but I see the internal stats about every 10s - as expected - using the same config with 1.24.1.
The really odd thing is that if you use "docker run" on the host instance everything is OK, it just a problem when the container is launched using ECS. I assume it must be something linked to the ECS framework and as far as I can see other plugins like inputs.docker and outputs.timestream are working as normal.
ok
Can you get any logs from the container both ECS specific logs as well as logs from telegraf? Anything present in either of those? Does it work for any amount of time? Were you able to try specifying only the internal + file plugins I mentioned earlier?
That very minimal config produced the correct output, but our usual config still does not, so today I've spent several hours adding and removing sections from the configuration file until I seem to have a way of reproducing the problem - and you will be glad to know that it can be reproduced without ECS - in fact I get the behaviour with Docker on a Mac.
This config will provide the expected output in 1.24.1, but not in 1.24.2:
[agent]
interval = "60s"
[[aggregators.merge]]
drop_original = true
[[processors.dedup]]
[[inputs.internal]]
collect_memstats = true
[[outputs.file]]
If I put it in /tmp/telegraf.conf then run Telegraf like this:
docker run -v /tmp/telegraf.conf:/etc/telegraf/telegraf.conf telegraf:1.24.1
I see the expected internal_docker run -v /tmp/telegraf.conf:/etc/telegraf/telegraf.conf telegraf:1.24.2
Then I see no output after the [agent] Config line even after 5-6 minutes.
If the [agent] interval is put back to the default 10s, or drop_original is commented out of [[aggregators.merge]], or the [[processors.dedup]] is removed then 1.24.2 behaves correctly.
As far as I can see the issue only happens when [agent] interval is not at the default and [[processors.dedup]] and _[[aggregators.merge]] droporiginal = true are all used together and it only seems to be _[[inputsinternal]] that is effected.
At least I now know how to tweak my configuration to get everything going again.
Thanks for that! I can reproduce that locally now, so we know it is not something ECS specific:
❯ ./telegraf --config config.toml
2022-10-12T18:32:49Z I! Starting Telegraf 1.25.0-fae64e2a
2022-10-12T18:32:49Z I! Available plugins: 223 inputs, 9 aggregators, 26 processors, 20 parsers, 57 outputs
2022-10-12T18:32:49Z I! Loaded inputs: internal
2022-10-12T18:32:49Z I! Loaded aggregators: merge
2022-10-12T18:32:49Z I! Loaded processors: dedup
2022-10-12T18:32:49Z I! Loaded outputs: file
2022-10-12T18:32:49Z I! Tags enabled: host=ryzen
2022-10-12T18:32:49Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ryzen", Flush Interval:10s
^C2022-10-12T18:36:17Z I! [agent] Hang on, flushing any cached metrics before shutdown
2022-10-12T18:36:17Z I! [agent] Stopping running outputs
❯ ../telegraf-builds/telegraf-v1.24.1 --config config.toml
2022-10-12T18:36:32Z I! Starting Telegraf 1.24.1
2022-10-12T18:36:32Z I! Available plugins: 222 inputs, 9 aggregators, 26 processors, 20 parsers, 57 outputs
2022-10-12T18:36:32Z I! Loaded inputs: internal
2022-10-12T18:36:32Z I! Loaded aggregators: merge
2022-10-12T18:36:32Z I! Loaded processors: dedup
2022-10-12T18:36:32Z I! Loaded outputs: file
2022-10-12T18:36:32Z I! Tags enabled: host=ryzen
2022-10-12T18:36:32Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"ryzen", Flush Interval:10s
internal_memstats,host=ryzen sys_bytes=40482056i,mallocs=168109i,frees=87154i,heap_in_use_bytes=16252928i,heap_objects=80955i,alloc_bytes=13591248i,pointer_lookups=0i,heap_alloc_bytes=13591248i,heap_sys_bytes=23887872i,heap_idle_bytes=7634944i,heap_released_bytes=7118848i,num_gc=6i,total_alloc_bytes=22235944i 1665599820000000000
internal_aggregate,aggregator=merge,host=ryzen,version=1.24.1 push_time_ns=880i,errors=0i,metrics_pushed=0i,metrics_filtered=0i,metrics_dropped=0i 1665599820000000000
internal_agent,go_version=1.19.1,host=ryzen,version=1.24.1 metrics_dropped=0i,metrics_gathered=1i,gather_errors=0i,metrics_written=0i 1665599820000000000
<snip>
The good news is there are not that many changes between v1.24.1 and v1.24.2. The one that stands out the most is https://github.com/influxdata/telegraf/commit/57fded8c9babf1c1b08af0e8a202f19630f6f29b and in fact reverting this commit, restores output.
@srebhan can you look into this one?
@Darth-Bobo can you please test if PR influxdata/telegraf#12081 fixes your issue!? You might want to use the artifact built in the mentioned PR and comment there or in influxdata/telegraf#12080 on your findings...
@srebhan I can confirm that the PR does indeed fix the problem for me. Thank you for the quick turn around!
Bit of an odd one this, 1.24.2 does not collect any data from the inputs.internal plugin but only when the container is launched by ECS on AWS.
If the same configuration is launched on the host from the command line the collection works fine and so does 1.24.1.
Any ideas gratefully received, it's driving me up the wall because I can't find any obvious problems.