Node uphsvlndc145 stopped shipping logs to Azure

hessam61 commented 4 years ago

At 1:30am the following error messages were captured in consuld slack channel:


gentryAPP  1:30 AM
gentry
Azure Log Analytics TCP check
dial tcp 170.166.23.33:25225: connect: connection refused
Service Tag
logging
Today at 1:30 AM
1:30
gentry
Azure Log Analytics TCP check
dial tcp 170.166.23.33:25225: connect: connection refused
Service Tag
logging
Today at 1:30 AM
1:30
gentry
Azure Log Analytics TCP check
dial tcp 170.166.23.33:25225: connect: connection refused
Service Tag
logging
Today at 1:30 AM

On Feb 3rd, 2020 only system monitoring containers and vent monitoring apps were running on this node. The last log message for CAdvisor container on azure is:

Query:

ContainerLog | sort by TimeGenerated desc | where Computer contains "uphsvlndc145" | where Image contains "cadvisor"

Last log:

2020-02-03T04:31:13.100837864Z E0203 04:31:13.100680 1 fsHandler.go:121] failed to collect filesystem stats - rootDiskErr: du command failed on /rootfs/var/lib/docker/overlay2/4a5000dca88d46ad76f831a5172d55b1a70301f91f90d96d1945d764cdeba29a/diff with output stdout: , stderr: du: /rootfs/var/lib/docker/overlay2/4a5000dca88d46ad76f831a5172d55b1a70301f91f90d96d1945d764cdeba29a/diff: No such file or directory\n

But the last log from the container itself by running nomad logs -stderr df77a71e cadvisor is:

E0203 21:18:04.351036       1 fsHandler.go:121] failed to collect filesystem stats - rootDiskErr: du command failed on /rootfs/var/lib/docker/overlay2/6213e3594a49034e1978fcf48c4245758256c4ab73af27e2d1bc31ccac6c52fc/diff with output stdout: , stderr: du: /rootfs/var/lib/docker/overlay2/6213e3594a49034e1978fcf48c4245758256c4ab73af27e2d1bc31ccac6c52fc/diff: No such file or directory

Same thing can be reproduced by fetching logs for monitor-mar:

From Azure: query:

ContainerLog | sort by TimeGenerated desc | where Computer contains "uphsvlndc145" | where Name contains "monitor-mar"

Latest log:

2020-02-03T06:28:26.346894763Z 2020-02-03 06:28:26,346 - monitor - INFO - Processed signal_mar_events, 12\n

Note that 6:28 is UTC time here.

From docker container itself:

sudo docker logs 3d2cbd98dd82

2020-02-03 23:06:12,638 - monitor - INFO - Processed signal_mar_events, 0

The microsoft/oms container on that node has been running since Jan 13th, and there's no indication of the container restarting due to any problem:

and there are no stderr logs for that container:

This could be due to a network error.

hessam61 commented 4 years ago

Consul reports the OMS container as not reachable, but the TCP port was open:

[hessam@uphsvlndc145 ~]$ netstat -tulpn | grep 2522
(No info could be read for "-p": geteuid()=3003 but you should be root.)
tcp        0      0 170.166.23.33:25225     0.0.0.0:*               LISTEN      -                   
udp        0      0 170.166.23.33:25225     0.0.0.0:*                           -

hessam61 commented 4 years ago

I couldn’t figure out what was going on with that container running on uphsvlndc145 host and decided to allocate more RAM to the job and restart. We didn’t lose any log. All the logs from the past 24hrs were still available and the OMS container shipped them all to ALO at once. The TimeGenerated timestamp is 02/04 since that’s when Azure received them. But the timestamp in LogEntry is when the log was generated in docker. Another reason to manufacture timestamp from the LogEntry.

ContainerLog | sort by TimeGenerated desc | where Computer == "uphsvlndc145"

hessam61 commented 4 years ago

Closing this for now.

pennsignals / legacy-system-services

Node uphsvlndc145 stopped shipping logs to Azure #3