chrony to SocketService

ktsaou commented 7 years ago

We have added chrony support to netdata.

It seems that most newer systems run chrony, so this is good.

However, we have implemented it with ExecutableService, which results to 1 fork per second on most systems.

I see in the code of chrony, the tracking request is relatively simple:

static int
process_cmd_tracking(char *line)
{
  CMD_Request request;
  CMD_Reply reply;
  IPAddr ip_addr;
  uint32_t ref_id;
  char name[50];
  struct timespec ref_time;

  request.command = htons(REQ_TRACKING);
  if (!request_reply(&request, &reply, RPY_TRACKING, 0))
    return 0;

  ref_id = ntohl(reply.data.tracking.ref_id);

  UTI_IPNetworkToHost(&reply.data.tracking.ip_addr, &ip_addr);
  format_name(name, sizeof (name), sizeof (name),
              ip_addr.family == IPADDR_UNSPEC, ref_id, &ip_addr);

  UTI_TimespecNetworkToHost(&reply.data.tracking.ref_time, &ref_time);

  print_report("Reference ID    : %R (%s)\n"
               "Stratum         : %u\n"
               "Ref time (UTC)  : %T\n"
               "System time     : %.9O of NTP time\n"
               "Last offset     : %+.9f seconds\n"
               "RMS offset      : %.9f seconds\n"
               "Frequency       : %.3F\n"
               "Residual freq   : %+.3f ppm\n"
               "Skew            : %.3f ppm\n"
               "Root delay      : %.9f seconds\n"
               "Root dispersion : %.9f seconds\n"
               "Update interval : %.1f seconds\n"
               "Leap status     : %L\n",
               (unsigned long)ref_id, name,
               ntohs(reply.data.tracking.stratum),
               &ref_time,
               UTI_FloatNetworkToHost(reply.data.tracking.current_correction),
               UTI_FloatNetworkToHost(reply.data.tracking.last_offset),
               UTI_FloatNetworkToHost(reply.data.tracking.rms_offset),
               UTI_FloatNetworkToHost(reply.data.tracking.freq_ppm),
               UTI_FloatNetworkToHost(reply.data.tracking.resid_freq_ppm),
               UTI_FloatNetworkToHost(reply.data.tracking.skew_ppm),
               UTI_FloatNetworkToHost(reply.data.tracking.root_delay),
               UTI_FloatNetworkToHost(reply.data.tracking.root_dispersion),
               UTI_FloatNetworkToHost(reply.data.tracking.last_update_interval),
               ntohs(reply.data.tracking.leap_status), REPORT_END);

  return 1;
}

So, it would be best if we can convert the chrony plugin to SocketService. This will also allow a netdata to monitor many (local and remote) chrony servers.

cc: @l2isbad @domschl @fooltux

domschl commented 7 years ago

I had a quick look at the chrony code, it seems that we would need to implement their protocol-version-handling. The tracking structure has been modified in a number protocol-versions, if I understood that correctly.

On Tue, Sep 12, 2017 at 9:20 AM, Costa Tsaousis notifications@github.com wrote:

We have added chrony support to netdata.

It seems that most newer systems run chrony, so this is good.

However, we have implemented it with ExecutableService, which results to 1 fork per second on most systems.

I see in the code of chrony, the tracking request is relatively simple:

static intprocess_cmd_tracking(char *line) { CMD_Request request; CMD_Reply reply; IPAddr ip_addr; uint32_t ref_id; char name[50]; struct timespec ref_time;

request.command = htons(REQ_TRACKING); if (!request_reply(&request, &reply, RPY_TRACKING, 0)) return 0;

ref_id = ntohl(reply.data.tracking.ref_id);

UTI_IPNetworkToHost(&reply.data.tracking.ip_addr, &ip_addr); format_name(name, sizeof (name), sizeof (name), ip_addr.family == IPADDR_UNSPEC, ref_id, &ip_addr);

UTI_TimespecNetworkToHost(&reply.data.tracking.ref_time, &ref_time);

print_report("Reference ID : %R (%s)\n" "Stratum : %u\n" "Ref time (UTC) : %T\n" "System time : %.9O of NTP time\n" "Last offset : %+.9f seconds\n" "RMS offset : %.9f seconds\n" "Frequency : %.3F\n" "Residual freq : %+.3f ppm\n" "Skew : %.3f ppm\n" "Root delay : %.9f seconds\n" "Root dispersion : %.9f seconds\n" "Update interval : %.1f seconds\n" "Leap status : %L\n", (unsigned long)ref_id, name, ntohs(reply.data.tracking.stratum), &ref_time, UTI_FloatNetworkToHost(reply.data.tracking.current_correction), UTI_FloatNetworkToHost(reply.data.tracking.last_offset), UTI_FloatNetworkToHost(reply.data.tracking.rms_offset), UTI_FloatNetworkToHost(reply.data.tracking.freq_ppm), UTI_FloatNetworkToHost(reply.data.tracking.resid_freq_ppm), UTI_FloatNetworkToHost(reply.data.tracking.skew_ppm), UTI_FloatNetworkToHost(reply.data.tracking.root_delay), UTI_FloatNetworkToHost(reply.data.tracking.root_dispersion), UTI_FloatNetworkToHost(reply.data.tracking.last_update_interval), ntohs(reply.data.tracking.leap_status), REPORT_END);

return 1; }

So, it would be best if we can convert the chrony plugin to SocketService. This will also allow a netdata to monitor many (local and remote) chrony servers.

cc: @l2isbad https://github.com/l2isbad @domschl https://github.com/domschl @fooltux https://github.com/fooltux

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/firehol/netdata/issues/2724, or mute the thread https://github.com/notifications/unsubscribe-auth/AFY1A8VTwxZX-MUenXRg9u2HMdUHMSogks5shjDbgaJpZM4PUNW9 .

ktsaou commented 7 years ago

You are right. It is probably too complex. As a workaround I have lowered data collection frequency to 5 seconds, by default, with #2730

domschl commented 7 years ago

Good. From my latest tests,I dont think we loose information by this, and that will keep load down even on embedded systems.

domschl commented 7 years ago

A way towards a socket-based implementation could be by using pypi's ntplib, which is compatible also with chrony, but seems to implement only protocol version 2 (currently, we are at v6 and v4, v5 have added useful statistics). Using an unmodified ntplib, only precision, root delay and root dispersion would be available. I will run some tests with this.

nocturo commented 7 years ago

I updated my 1.6 instance to 1.8, then to master and now from time to time I see issues with chronyc using 100% cpu doing ton of select():

select(4, [3], [], [], {-82087, 385105}) = -1 EINVAL (Invalid argument)
select(4, [3], [], [], {-82087, 385087}) = -1 EINVAL (Invalid argument)
select(4, [3], [], [], {-82087, 385069}) = -1 EINVAL (Invalid argument)

and it goes on.

Top says: 27976 netdata 20 0 18252 904 732 R 99.7 0.0 458:45.00 /usr/bin/chronyc -n tracking

chrony-3.1-2.el7.centos.x86_64

Not sure if this is proper issue to report to, but since support is new, I might as well :)

ktsaou commented 7 years ago

@nocturo this seems like a bug in chronyc. I suggest to disable chrony monitoring at netdata in the mean time.

@l2isbad I think of disabling this plugin by default. I really don't like this happening. What do you think?

@nocturo make sure your centos is updated. If it is, we should really disable chrony in netdata by default.

ilyam8 commented 7 years ago

Sounds reasonable. What about to change update_every? @nocturo please change the update_every to 10 and see if the problem remains.

ktsaou commented 7 years ago

What about to change update_every? @nocturo please change the update_every to 10 and see if the problem remains.

hm... I think EINVAL will not be fixed by that. There is a faulty socket setup at chrony. From man select:

EINVAL nfds is negative or exceeds the RLIMIT_NOFILE resource limit (see getrlimit(2)).

man getrlimit says:

       RLIMIT_NOFILE
              Specifies a value one greater than the maximum file descriptor number that can be opened by this process.
              Attempts (open(2), pipe(2), dup(2), etc.)  to exceed this limit yield the error  EMFILE.   (Historically,
              this limit was named RLIMIT_OFILE on BSD.)

So, either chronyc passes a negative nfds, or it opens so many files that exceeds its resource limits (1024 files per process on most systems).

The idea is that chronyc should break the loop if that happens. But it does not.

nocturo commented 7 years ago

@ktsaou Everything is up2date according to distro. There is a newer 3.2 chrony package that was released not long ago but it's not been packaged yet.

@l2isbad I've changed it from 5 to 10 and I'll let you know if it happens again. It's not easily reproducible, so I just have to wait.

domschl commented 7 years ago

Looking at chronyc: EINVAL can also occur, if the select's timeout value is negative. That can happen in chronyc,

if the number of sent-to-chrony repeat-attempts exceeds the number of bit of unsigned int...:
if send() for some reason takes more than one (first attempt) or a couple of secs (on retry), [while using 100% cpu]. In both cases, the chronyc timeout-code is not safe:
```
timeout = initial_timeout / 1000.0 * (1U << (n_attempts - 1)) -
          UTI_DiffTimespecsToDouble(&ts_now, &ts_start);
```
Chronyc's timeout-value increases geometrically, but if the send-timeout() are in order of netdata's chrony-pollrate, or n_attempts exceeds the int-bitness, we have 100% cpu, permanently.

Yet, I have no idea how such a scenario could happen.

Also, if this happens again, is there more information in log-files (/var/log/chrony)?

kmai007 commented 7 years ago

how do i disable netdata/etc/netdata/python.d/chrony.conf ? I'm not sure if its very clear to me.

nocturo commented 7 years ago

@domschl There is nothing logged in /var/log/chrony, and chronyd journal once a day says: Oct 02 01:47:18 terran chronyd[1073]: Can't synchronise: no majority

and it's around the same time every day. (+/- 5 mins) It proceeds later on but maybe chronyc hits this and somehow stalls with 100% cpu. I don't have any other debugging information as with 10 seconds interval I haven't seen it hit 100% cpu yet. I'm running a parallel check like: export count=0; while true; do let count+=1;chronyc -n tracking > /dev/null; echo $count;sleep 5;done

to see if interval has anything to do with the lockup. If the counter stops I know it's blocking. Only other thing I checked was the stack which was empty.

@kmai007 edit conf.d/python.d.conf like in https://github.com/firehol/netdata/pull/2834/commits/6d8bbdc901e129508b489deb3c018d811e29c5c8

kmai007 commented 7 years ago

i too am exhibiting the same 100% CPU cycles on chrony.

i'm using Your netdata version: 1.7.0-211-gaebbd496_rolling Your netdata commit: aebbd49

ktsaou commented 7 years ago

edit /etc/netdata/python.d.conf and set chrony: no. If you update your netdata, it is disabled by default.

Which distro do you use?

kmai007 commented 7 years ago

rhel

On Oct 6, 2017 11:41 AM, "Costa Tsaousis" notifications@github.com wrote:

edit /etc/netdata/python.d.conf and set chrony: no. If you update your netdata, it is disabled by default https://github.com/firehol/netdata/blob/223f504a6d7277bcfc5dcae244ce70991b8179c4/conf.d/python.d.conf#L33 .

Which distro do you use?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/firehol/netdata/issues/2724#issuecomment-334807693, or mute the thread https://github.com/notifications/unsubscribe-auth/Abn2qtO1yLMzFoAlC_XXDdeBwmaWwD0aks5splgkgaJpZM4PUNW9 .

stale[bot] commented 5 years ago

Currently netdata team doesn't have enough capacity to work on this issue. We will be more than glad to accept a pull request with a solution to problem described here. This issue will be closed after another 60 days of inactivity.

ilyam8 commented 3 years ago

Closing this feature request. We will re-evaluate the request internally.

netdata / netdata

chrony to SocketService #2724