Closed hessam61 closed 4 years ago
Consul reports the OMS container as not reachable, but the TCP port was open:
[hessam@uphsvlndc145 ~]$ netstat -tulpn | grep 2522
(No info could be read for "-p": geteuid()=3003 but you should be root.)
tcp 0 0 170.166.23.33:25225 0.0.0.0:* LISTEN -
udp 0 0 170.166.23.33:25225 0.0.0.0:* -
I couldn’t figure out what was going on with that container running on uphsvlndc145 host and decided to allocate more RAM to the job and restart. We didn’t lose any log. All the logs from the past 24hrs were still available and the OMS container shipped them all to ALO at once. The TimeGenerated timestamp is 02/04 since that’s when Azure received them. But the timestamp in LogEntry is when the log was generated in docker. Another reason to manufacture timestamp from the LogEntry.
ContainerLog | sort by TimeGenerated desc | where Computer == "uphsvlndc145"
Closing this for now.
At 1:30am the following error messages were captured in
consuld
slack channel:On Feb 3rd, 2020 only system monitoring containers and vent monitoring apps were running on this node. The last log message for CAdvisor container on azure is:
Query:
Last log:
But the last log from the container itself by running
nomad logs -stderr df77a71e cadvisor
is:Same thing can be reproduced by fetching logs for
monitor-mar
:From Azure: query:
Latest log:
Note that 6:28 is UTC time here.
From docker container itself:
The
microsoft/oms
container on that node has been running since Jan 13th, and there's no indication of the container restarting due to any problem:and there are no stderr logs for that container:
This could be due to a network error.