Closed ktsaou closed 3 years ago
I had a quick look at the chrony code, it seems that we would need to implement their protocol-version-handling. The tracking structure has been modified in a number protocol-versions, if I understood that correctly.
On Tue, Sep 12, 2017 at 9:20 AM, Costa Tsaousis notifications@github.com wrote:
We have added chrony support to netdata.
It seems that most newer systems run chrony, so this is good.
However, we have implemented it with ExecutableService, which results to 1 fork per second on most systems.
I see in the code of chrony, the tracking request is relatively simple:
static intprocess_cmd_tracking(char *line) { CMD_Request request; CMD_Reply reply; IPAddr ip_addr; uint32_t ref_id; char name[50]; struct timespec ref_time;
request.command = htons(REQ_TRACKING); if (!request_reply(&request, &reply, RPY_TRACKING, 0)) return 0;
ref_id = ntohl(reply.data.tracking.ref_id);
UTI_IPNetworkToHost(&reply.data.tracking.ip_addr, &ip_addr); format_name(name, sizeof (name), sizeof (name), ip_addr.family == IPADDR_UNSPEC, ref_id, &ip_addr);
UTI_TimespecNetworkToHost(&reply.data.tracking.ref_time, &ref_time);
print_report("Reference ID : %R (%s)\n" "Stratum : %u\n" "Ref time (UTC) : %T\n" "System time : %.9O of NTP time\n" "Last offset : %+.9f seconds\n" "RMS offset : %.9f seconds\n" "Frequency : %.3F\n" "Residual freq : %+.3f ppm\n" "Skew : %.3f ppm\n" "Root delay : %.9f seconds\n" "Root dispersion : %.9f seconds\n" "Update interval : %.1f seconds\n" "Leap status : %L\n", (unsigned long)ref_id, name, ntohs(reply.data.tracking.stratum), &ref_time, UTI_FloatNetworkToHost(reply.data.tracking.current_correction), UTI_FloatNetworkToHost(reply.data.tracking.last_offset), UTI_FloatNetworkToHost(reply.data.tracking.rms_offset), UTI_FloatNetworkToHost(reply.data.tracking.freq_ppm), UTI_FloatNetworkToHost(reply.data.tracking.resid_freq_ppm), UTI_FloatNetworkToHost(reply.data.tracking.skew_ppm), UTI_FloatNetworkToHost(reply.data.tracking.root_delay), UTI_FloatNetworkToHost(reply.data.tracking.root_dispersion), UTI_FloatNetworkToHost(reply.data.tracking.last_update_interval), ntohs(reply.data.tracking.leap_status), REPORT_END);
return 1; }
So, it would be best if we can convert the chrony plugin to SocketService. This will also allow a netdata to monitor many (local and remote) chrony servers.
cc: @l2isbad https://github.com/l2isbad @domschl https://github.com/domschl @fooltux https://github.com/fooltux
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/firehol/netdata/issues/2724, or mute the thread https://github.com/notifications/unsubscribe-auth/AFY1A8VTwxZX-MUenXRg9u2HMdUHMSogks5shjDbgaJpZM4PUNW9 .
You are right. It is probably too complex. As a workaround I have lowered data collection frequency to 5 seconds, by default, with #2730
Good. From my latest tests,I dont think we loose information by this, and that will keep load down even on embedded systems.
A way towards a socket-based implementation could be by using pypi's ntplib, which is compatible also with chrony, but seems to implement only protocol version 2 (currently, we are at v6 and v4, v5 have added useful statistics). Using an unmodified ntplib, only precision, root delay and root dispersion would be available. I will run some tests with this.
I updated my 1.6 instance to 1.8, then to master and now from time to time I see issues with chronyc using 100% cpu doing ton of select():
select(4, [3], [], [], {-82087, 385105}) = -1 EINVAL (Invalid argument)
select(4, [3], [], [], {-82087, 385087}) = -1 EINVAL (Invalid argument)
select(4, [3], [], [], {-82087, 385069}) = -1 EINVAL (Invalid argument)
and it goes on.
Top says: 27976 netdata 20 0 18252 904 732 R 99.7 0.0 458:45.00 /usr/bin/chronyc -n tracking
chrony-3.1-2.el7.centos.x86_64
Not sure if this is proper issue to report to, but since support is new, I might as well :)
@nocturo this seems like a bug in chronyc
.
I suggest to disable chrony
monitoring at netdata in the mean time.
@l2isbad I think of disabling this plugin by default. I really don't like this happening. What do you think?
@nocturo make sure your centos is updated. If it is, we should really disable chrony
in netdata by default.
Sounds reasonable. What about to change update_every
?
@nocturo please change the update_every to 10 and see if the problem remains.
What about to change update_every? @nocturo please change the update_every to 10 and see if the problem remains.
hm... I think EINVAL
will not be fixed by that. There is a faulty socket setup at chrony
. From man select
:
EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see getrlimit(2)).
man getrlimit
says:
RLIMIT_NOFILE
Specifies a value one greater than the maximum file descriptor number that can be opened by this process.
Attempts (open(2), pipe(2), dup(2), etc.) to exceed this limit yield the error EMFILE. (Historically,
this limit was named RLIMIT_OFILE on BSD.)
So, either chronyc
passes a negative nfds
, or it opens so many files that exceeds its resource limits (1024 files per process on most systems).
The idea is that chronyc
should break the loop if that happens. But it does not.
@ktsaou Everything is up2date according to distro. There is a newer 3.2 chrony package that was released not long ago but it's not been packaged yet.
@l2isbad I've changed it from 5 to 10 and I'll let you know if it happens again. It's not easily reproducible, so I just have to wait.
Looking at chronyc: EINVAL can also occur, if the select's timeout value is negative. That can happen in chronyc,
timeout = initial_timeout / 1000.0 * (1U << (n_attempts - 1)) -
UTI_DiffTimespecsToDouble(&ts_now, &ts_start);
Chronyc's timeout-value increases geometrically, but if the send-timeout() are in order of netdata's chrony-pollrate, or n_attempts exceeds the int-bitness, we have 100% cpu, permanently.
Yet, I have no idea how such a scenario could happen.
Also, if this happens again, is there more information in log-files (/var/log/chrony)?
how do i disable netdata/etc/netdata/python.d/chrony.conf ? I'm not sure if its very clear to me.
@domschl There is nothing logged in /var/log/chrony, and chronyd journal once a day says:
Oct 02 01:47:18 terran chronyd[1073]: Can't synchronise: no majority
and it's around the same time every day. (+/- 5 mins) It proceeds later on but maybe chronyc hits this and somehow stalls with 100% cpu. I don't have any other debugging information as with 10 seconds interval I haven't seen it hit 100% cpu yet. I'm running a parallel check like:
export count=0; while true; do let count+=1;chronyc -n tracking > /dev/null; echo $count;sleep 5;done
to see if interval has anything to do with the lockup. If the counter stops I know it's blocking. Only other thing I checked was the stack which was empty.
@kmai007 edit conf.d/python.d.conf like in https://github.com/firehol/netdata/pull/2834/commits/6d8bbdc901e129508b489deb3c018d811e29c5c8
i too am exhibiting the same 100% CPU cycles on chrony.
i'm using Your netdata version: 1.7.0-211-gaebbd496_rolling Your netdata commit: aebbd49
edit /etc/netdata/python.d.conf
and set chrony: no
.
If you update your netdata, it is disabled by default.
Which distro do you use?
rhel
On Oct 6, 2017 11:41 AM, "Costa Tsaousis" notifications@github.com wrote:
edit /etc/netdata/python.d.conf and set chrony: no. If you update your netdata, it is disabled by default https://github.com/firehol/netdata/blob/223f504a6d7277bcfc5dcae244ce70991b8179c4/conf.d/python.d.conf#L33 .
Which distro do you use?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/firehol/netdata/issues/2724#issuecomment-334807693, or mute the thread https://github.com/notifications/unsubscribe-auth/Abn2qtO1yLMzFoAlC_XXDdeBwmaWwD0aks5splgkgaJpZM4PUNW9 .
Currently netdata team doesn't have enough capacity to work on this issue. We will be more than glad to accept a pull request with a solution to problem described here. This issue will be closed after another 60 days of inactivity.
Closing this feature request. We will re-evaluate the request internally.
We have added chrony support to netdata.
It seems that most newer systems run chrony, so this is good.
However, we have implemented it with
ExecutableService
, which results to 1 fork per second on most systems.I see in the code of chrony, the
tracking
request is relatively simple:So, it would be best if we can convert the chrony plugin to
SocketService
. This will also allow a netdata to monitor many (local and remote) chrony servers.cc: @l2isbad @domschl @fooltux