Open elangovanseshan opened 10 months ago
@powersj New issue created
As an aside, it looks like so far this issue appears to be absent from 1.24.2.
here are the dbus details which we are seeing it in server
root 336 1 0 07:03 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 353 1 0 06:16 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 372 1 0 03:43 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 385 1 0 08:38 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 426 1 0 07:03 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 427 1 0 05:21 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 458 1 0 08:39 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 495 1 0 09:25 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 569 1 0 04:30 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 602 1 0 07:04 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 643 1 0 08:39 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 701 1 0 09:26 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 823 1 0 09:26 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 840 1 0 06:17 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 884 1 0 04:31 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 896 1 0 02:47 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 936 1 0 05:21 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 940 1 0 07:04 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 951 1 0 08:40 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 982 1 0 09:27 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1020 1 0 03:44 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1068 1 0 05:22 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1111 1 0 07:55 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1116 1 0 04:31 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1210 1 0 07:55 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1230 1 0 03:44 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1305 1 0 03:45 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1429 1 0 02:47 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1518 1 0 03:45 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1736 1 0 03:46 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1756 1 0 07:56 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1815 1 0 09:27 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1874 1 0 07:56 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1887 1 0 09:28 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1937 1 0 04:32 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 1963 1 0 03:46 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2022 1 0 08:41 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2029 1 0 04:32 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2044 1 0 09:28 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2112 1 0 07:57 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2118 1 0 09:29 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2154 1 0 05:22 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2156 1 0 04:33 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2184 1 0 09:29 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2211 1 0 02:48 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2242 1 0 07:57 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2253 1 0 08:41 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2272 1 0 04:33 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2298 1 0 05:23 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2330 1 0 09:30 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2353 1 0 07:58 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2365 1 0 04:34 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2408 1 0 07:58 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2418 1 0 07:05 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2461 1 0 05:23 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2480 1 0 07:59 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2602 1 0 07:05 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2623 1 0 04:34 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2643 1 0 07:59 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2678 1 0 03:47 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2681 1 0 08:42 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2690 1 0 05:24 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2704 1 0 04:35 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2748 1 0 04:35 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 2788 1 0 07:06 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
telegraf 5850 1 0 06:54 ? 00:00:02 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf telegraf.d
telegraf 5866 1 0 06:54 ? 00:00:00 /usr/bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 11702 24282 0 07:34 pts/0 00:00:00 grep --color=auto -i telegraf
Is there a way to disable the secret store completely? We don't use it and some component related to it seems to be causing the issues.
Thanks for the issue and logs. Are you seeing this across RHEL 6, 7, and 8 this time? Or only RHEL 6? I have got a RHEL 7 VM up looping over telegraf with --once
to see if I can see multiple dbus-daemon's starting. I am over 10k loops and nothing showing up yet.
Is there a way to disable the secret store completely?
Only with a custom build of Telegraf.
Assuming that the issue is with the same code of the secret store as last time, that dbus command runs in the init
function of that library. Which means the function is run as soon as the library is imported, before we have any time to do anything else.
Thanks Joshua for your reply ,I could see this issue only from RHEL6 and i have the latest version deployed in RHEL7/8 and there i don't see any issue with DBUS.
ps -ef|grep -i telegraf
telegraf 14007 1 10 09:49 ? 00:00:01 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 14048 1 0 09:49 ? 00:00:00 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root 14366 7614 0 09:49 pts/0 00:00:00 grep -i telegraf
@crflanigan, @elangovanseshan,
If you must have the newer version of Telegraf on RHEL 6, my suggestion then is to consider building telegraf with the custom builder. The result would provide you with a ~23Mb binary containing only the plugins you need and the secret store plugins would not be present.
git clone https://github.com/influxdata/telegraf
cd telegraf
go build -o ./tools/custom_builder/custom_builder ./tools/custom_builder
./tools/custom_builder/custom_builder --config <conf_file> --config-dir <conf_dir>
Would this be an option for you?
Hi @powersj,
We can look at that. I had thought that the Secret Store is core to Telegraf irrespective of the configuration you use post release 1.25, is that right?
Is Telegraf not supported on RHEL 6, if so, when was the last release where it was supported?
Thanks!
Thank you @powersj ,Let me try custom builder without secret store plugins
I had thought that the Secret Store is core to Telegraf irrespective of the configuration you use post release 1.25, is that right?
Internally to Telegraf, the secret stores are treated like the other plugins, so that you could build telegraf without it.
Is Telegraf not supported on RHEL 6, if so, when was the last release where it was supported?
We have a published doc for supported platforms, which essentially says we support OSes that are under standard support. In line with that, RHEL 6 stopped being supported at the end of 2020. RHEL 7 will stop next June 2024.
While we will not go out of our way to break any previous releases, if we do make a change that breaks them we are less inclined to revert it nor will we continue to test it.
Ok @powersj ,
It sounds like patching this issue is unlikely since it's occuring on an unsupported OS, is that right?
Thanks!
It sounds like patching this issue is unlikely since it's occuring on an unsupported OS, is that right?
If you proposed a PR or an idea to get around this we would certainly consider it. We are not going to completely close the door on a fix.
@powersj,
Fair enough, thanks buddy!
@crflanigan, @elangovanseshan,
If you must have the newer version of Telegraf on RHEL 6, my suggestion then is to consider building telegraf with the custom builder. The result would provide you with a ~23Mb binary containing only the plugins you need and the secret store plugins would not be present.
git clone https://github.com/influxdata/telegraf cd telegraf go build -o ./tools/custom_builder/custom_builder ./tools/custom_builder ./tools/custom_builder/custom_builder --config <conf_file> --config-dir <conf_dir>
Would this be an option for you?
Thanks ! @powersj custom_builder is working fine for me. I passed the sample conf file to build the binary ,it contain the cpu disk diskio exec mem net swap system input plugins and it's working fine. We have multiple internal teams are using multiple input plugins other than i mentioned above so if we build the binary with limited input plugins, it will affect other internal customers, So we would like to build the custom binary with all input plugins but except secret store plugins . Is there any possible way to build it without passing conf file for each input plugins or can we build with dummy conf files without secret store plugins?
So now i would like to build the custom binary with all input plugins but except secret store plugins
You can get a list of all the input plugins by generating the default config and grep'ing out all the input headers:
make
./telegraf config > default.toml
grep "^# \[\[inputs.*\]\]" default.toml | cut -d' ' -f2 | sort | uniq
You could then add that to your example config or pass that as a second file to the custom builder.
You could also use the various build tags to build telegraf as the customization docs show using BUILDTAGS
:
BUILDTAGS="custom,aggregators,inputs,outputs,parsers,processors,serializers" make
If you do start to go this route, please ensure you include everything you actually need ;) It is easy to forget or not realize you are using a serializer for example. This is why I like the custom builder + an actual config better.
@powersj our initial testing is working fine with custom Telegraf with limited input and output plugin and no evidence of dbus process .
also i would like to know that how can we add the serializers to custom build? I added the required input,output,aggregators,processors through the example conf but not sure about serializers .
Do we need to pass it through conf file or do we have any other option?
Do we need to pass it through conf file or do we have any other option?
You can reference any of the serializers the same way. For example, if you want only the JSON serialier you can add serializers.json
to the build tags.
The way to determine these build tags is to look in each plugin's all
folder and look at the build tags at the top of a file. This is the JSON all file and you can see that the JSON serializer is imported if this is not a custom build, if a user specifies serializers
, which pulls in all serializers, or if they specify serializers.json
.
Does that help?
Thank you @powersj let me try this out
one more thing for your information, initially i updated like dbus issue happening only in RHEL6 servers but we had an issue with RHEL7/8 as well .
So we are planning to go with custom telegraf with limited plugins .
@elangovanseshan, @crflanigan,
but we had an issue with RHEL7/8 as well .
Sorry I never responded to this. Looking at the mentioned gosnowflake issue it looks like a workaround is setting DBUS_SESSION_BUS_ADDRESS=$XDG_RUNTIME_DIR/bus
in the environment as well.
For Telegraf, I am inclined to document this and link to the still open upstream issue. Thoughts?
Hi @powersj,
Sorry for the delayed response.
I actually commented on one of these issues for keyring and got a notification this morning that they may have resolved it? Seems like a lot of people use this library.
https://github.com/99designs/keyring/issues/103
What do you think?
Hey @crflanigan,
Did someone delete their comment? Latest I see is from Apr 12, 2023.
@powersj I think @crflanigan was referring to https://github.com/snowflakedb/gosnowflake/issues/773#issuecomment-2024775431
BTW, I now have that message even when not using outputs.sql
at all..
WARN[0000]log.go:244 gosnowflake.(*defaultLogger).Warn DBUS_SESSION_BUS_ADDRESS envvar looks to be not set, this can lead to runaway dbus-daemon processes. To avoid this, set envvar DBUS_SESSION_BUS_ADDRESS=$XDG_RUNTIME_DIR/bus (if it exists) or DBUS_SESSION_BUS_ADDRESS=/dev/null. 2024-05-31T14:00:14Z I! Loading config: test.toml 2024-05-31T14:00:14Z I! Starting Telegraf 1.31.0-35bff98f brought to you by InfluxData the makers of InfluxDB 2024-05-31T14:00:14Z I! Available plugins: 234 inputs, 9 aggregators, 32 processors, 26 parsers, 60 outputs, 6 secret-stores 2024-05-31T14:00:14Z I! Loaded inputs: snmp 2024-05-31T14:00:14Z I! Loaded aggregators: 2024-05-31T14:00:14Z I! Loaded processors: 2024-05-31T14:00:14Z I! Loaded secretstores: 2024-05-31T14:00:14Z W! Outputs are not used in testing mode! 2024-05-31T14:00:14Z I! Tags enabled:
This does not happen with telegraf 1.30.3
Relevant telegraf.conf
Logs from Telegraf
System info
telegraf-1.27.2 it's running in OS Linux 2.6.32-754.50.1.el6.x86_64
Docker
No response
Steps to reproduce
Reproducing has been tricky as it doesn't always appear to occur, but on systems that were impacted (hundreds+) reverting Telegraf to an earlier version, stopping the Telegraf service and removing the orphaned process, or performing the below actions resolved the issue.
What we have seen: Upgrading the Telegraf version 1.14 to 1.25.2 on RHEL servers seems to create an issue where DBus generates many orphaned processes. This eventually causes the system to hit the ceiling of available PIDs. Rolling back to 1.14 seems to clear the problem.
Example from one of our systems:
ps -ef|grep dbus|grep -v grep|wc -l 1459
Based on the issue https://github.com/influxdata/telegraf/issues/13481 it was resolved in recent release telegraf-1.27.2 but we are experiencing the same issue with recent release aswell
Expected behavior
Telegraf works as expected.
Actual behavior
Telegraf inadvertantly creates thousands of orphaned DBus processes which eventually causes the available PID's to hit the maximum ceiling, which causes system degradation.
Additional info
No response