Closed lgayatri closed 6 years ago
@lgayatri We could do with the container-logs bundle to actually check cVM state and what happened inside them.
The following container IDs are reported as up, but we have die events for them in the personality log (@cgtexmex another instance of concurrent modification not handled in busy VC environment):
189f6c9a8e06
ea7b738d181b
f7fc61ed213e - Failed to commit handle after network unbind for container
932265469db1
0dfa599bafa7
b5adf49b2d08 - Failed to commit handle after network unbind for container
870e1cfcd026
We do not have die events for these two containers - if these are still running then we need to figure out why sleep didn't exit. If they are not we need to find out why we missed the events (too many events at once overwhelmed our collector?):
16e435f7ca00
2a09218d0378
Interesting snippet - I think this is during docker run
- it's also suggesting it took almost 1min to start and the change version was modified. Probably got vMotioned almost immediately which is odd - why isn't DRS power on placement making better decisions?
Sep 11 2017 08:20:47.921Z INFO set log file name to: [CPBU_2TB_01] e44b48a760aa794b3fee0d4dd004ee7f2f8d5054611aea3b3ac2ce8f9cc9983b/tether.debug
Sep 11 2017 08:20:47.922Z INFO set log file name to: [CPBU_2TB_01] e44b48a760aa794b3fee0d4dd004ee7f2f8d5054611aea3b3ac2ce8f9cc9983b/output.log
Sep 11 2017 08:20:53.996Z INFO Reconfigure: attempting update to e44b48a760aa794b3fee0d4dd004ee7f2f8d5054611aea3b3ac2ce8f9cc9983b with change version "2017-09-11T08:20:50.617682Z" (poweredOff)
Sep 11 2017 08:20:54.979Z INFO Reconfigure: committed update to e44b48a760aa794b3fee0d4dd004ee7f2f8d5054611aea3b3ac2ce8f9cc9983b with change version: 2017-09-11T08:20:50.617682Z
Sep 11 2017 08:21:53.254Z WARN container and handle ChangeVersions do not match: 2017-09-11T08:20:55.136673Z != 2017-09-11T08:20:50.617682Z
@sflxn this might be related to #6850 #6851. please take a look and let us know.
I've done some testing on our event handling and I'm thinking that this could be related to missed events.
The 50 containers would all be exiting at roughly the same time which would cause a mini-event storm. That storm would over run our current event handling implementation causing us to miss the poweredOff event from vSphere.
The long term fix is to rework our event handling (which we discussed earlier this year), but an interim fix is increase the pageSize
of the event collector from 25 to the vSphere default 1000.
If #6851 was executed in the order of this comment then the event storm could be the reason for those failures. That's because the powerOff event would be missed -- when that event is handled we release the IPs, so if that's missed IPs are orphaned. The subsequent container starts would attempt to use the IP that according to the portlayer is still in use.
@cgtexmex That makes a lot of sense. Unfortunately, we have so little logging around this part of the networking code that there is no way to verify whether we're missing the events in #6851. I've also reproduced a case of #6851 where I do not believe missing events is the cause. Two containers come in at the same time, one gets the IP, the other gets rejected. Both have the same IP in their tether logs.
@sflxn yes..the logging in network is not the greatest -- but there should be log output for all the containers that were stopped by an Event. If you know a containerVM should've been stopped by an event and there's no portLayer log indicating that then that's a miss. The log line would be something like this:
Dec 8 2017 22:01:06.630Z DEBUG op=298.30: Container(ecab92cdc050433b1c74bb8f9603341dbe2f20a7cc5b99454f4af53aa577ed7c) state set to Stopped via event activity
So for each of the containers that were started with sleep
-- there should be a portLayer log entry like the one above.
Unless there are more than one example of this, the page size isn't the cause. I've been looking at the logs around events. Here's snippets from the portlayer and persona:
portlayer
PoweredOff Event
----------------
Dec 11 2017 00:13:37.091Z DEBUG op=293.7184: [NewOperation] op=293.7184 [vic/lib/portlayer/exec.(*Container).OnEvent:715]
Dec 11 2017 00:13:37.091Z DEBUG [BEGIN] op=293.7184 [vic/lib/portlayer/exec.(*Container).OnEvent:716] eventID(2252866) received for event: PoweredOff
Dec 11 2017 00:13:37.091Z DEBUG [BEGIN] [vic/lib/portlayer/exec.eventedState:689] event PoweredOff received for id: 2252866
Dec 11 2017 00:13:37.091Z DEBUG [ END ] [vic/lib/portlayer/exec.eventedState:689] [9.823µs] event PoweredOff received for id: 2252866
Dec 11 2017 00:13:37.091Z DEBUG [BEGIN] op=293.7184 [vic/lib/portlayer/exec.(*containerBase).updates:119] 4ed574d44bb6c034bd82ad8f97d05b36de8204697d658f5d060eba5126ba8a1a
Dec 11 2017 00:13:37.094Z DEBUG op=293.7184: Update: for 4ed574d44bb6c034bd82ad8f97d05b36de8204697d658f5d060eba5126ba8a1a, refreshing from change version 2017-12-10T23:55:28.144836Z
Dec 11 2017 00:13:37.094Z DEBUG [BEGIN] op=293.7184 [vic/pkg/vsphere/vm.(*VirtualMachine).Properties:519] VM(VirtualMachine:vm-26341) Properties([config runtime])
Dec 11 2017 00:13:37.094Z DEBUG op=293.7184: properties: [config runtime summary.runtime.connectionState]
Dec 11 2017 00:13:37.138Z DEBUG [ END ] op=293.7184 [vic/pkg/vsphere/vm.(*VirtualMachine).Properties:519] [43.528436ms] VM(VirtualMachine:vm-26341) Properties([config runtime])
Dec 11 2017 00:13:37.143Z DEBUG op=293.7184: Update: for 4ed574d44bb6c034bd82ad8f97d05b36de8204697d658f5d060eba5126ba8a1a, change version 2017-12-11T00:11:55.163014Z, extraconfig id: 4ed574d44bb6c034bd82ad8f97d05b36de8204697d658f5d060eba5126ba8a1a
Dec 11 2017 00:13:37.144Z DEBUG [ END ] op=293.7184 [vic/lib/portlayer/exec.(*containerBase).updates:119] [52.581331ms] 4ed574d44bb6c034bd82ad8f97d05b36de8204697d658f5d060eba5126ba8a1a
Dec 11 2017 00:13:37.144Z DEBUG Setting container 4ed574d44bb6c034bd82ad8f97d05b36de8204697d658f5d060eba5126ba8a1a state: Stopped
OnStop() called from OnEvent()
------------------------------
Dec 11 2017 00:13:37.144Z DEBUG Container(4ed574d44bb6c034bd82ad8f97d05b36de8204697d658f5d060eba5126ba8a1a) closing 0 log followers
Dec 11 2017 00:13:37.144Z DEBUG op=293.7184: Container(4ed574d44bb6c034bd82ad8f97d05b36de8204697d658f5d060eba5126ba8a1a) state set to Stopped via event activity
Sending Event to Persona
------------------------
Dec 11 2017 00:13:37.144Z DEBUG Container (4ed574d44bb6c034bd82ad8f97d05b36de8204697d658f5d060eba5126ba8a1a) publishing event Stopped from event PoweredOff
Dec 11 2017 00:13:37.144Z DEBUG [ END ] op=293.7184 [vic/lib/portlayer/exec.(*Container).OnEvent:716] [52.807255ms] eventID(2252866) received for event: PoweredOff
persona
Dec 11 2017 00:13:37.144Z DEBUG [BEGIN] [vic/lib/apiservers/engine/backends.DockerEventPublisher.PublishEvent:220] Event Monitor received eventID(293.7184) for container(4ed574d44bb6) - PoweredOff
Dec 11 2017 00:13:37.144Z DEBUG [ END ] [vic/lib/apiservers/engine/backends.DockerEventPublisher.PublishEvent:220] [49.495µs] Event Monitor received eventID(293.7184) for container(4ed574d44bb6) - PoweredOff
btw, this is really easy to reproduce.
The one thing that I discovered that may lead to the root cause is that if a container shows it's still running after docker ps, I can "correct" it by performing a docker inspect on it.
I've verified the above steps are reproducible on several "still running" containers.
@sflxn inspect
will force a refresh from vSphere on each call, while ps
trusts the vic cache to be correct...this is typically a sign that we've missed events...
@sflxn do we have all the logs for these containers? The portLayer log in the bundle seems pretty small and one of the containers listed as powered on (870e1cfcd026) only has a couple entries...
@cgtexmex We have logs. I'll update a few more today. I found yet another case of this issue. Again, nothing to do with page size. This second scenario is a little easier to track, and I might have a fix for it.
@mdubya66
root@Ubuntu-VIc:/home/vmlib/VIC/VIC-1.3/vic# docker -H 10.197.37.142:2376 --tls ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e0e6fe4a1928 photon "sleep 45m" About an hour ago Up About an hour focused_mestorf
fe1bd8437e25 photon "sleep 45m" About an hour ago Up About an hour kickass_wescoff
004d2f2d54dc photon "sleep 45m" About an hour ago Up About an hour hardcore_mcclintock
0436b7127e1e photon "sleep 45m" About an hour ago Up About an hour eager_visvesvaraya
6226fd88afdc photon "sleep 45m" About an hour ago Up About an hour sad_cray
c4c9bfa01f24 photon "sleep 45m" About an hour ago Up About an hour heuristic_curie
1b981a2e74c3 photon "sleep 45m" About an hour ago Up About an hour determined_cray
ac463e8f4508 photon "sleep 45m" About an hour ago Up About an hour peaceful_feynman
61905273edce photon "sleep 45m" About an hour ago Up About an hour angry_albattani
cb8a11113363 photon "sleep 45m" About an hour ago Up About an hour adoring_swanson
Seen in RC3 as well. About 108 containers did not exit and are running for more than 45min.
@cgtexmex I looked at the second scenario I saw again with a fresh mind, and I think it may actually be due to missing events as the one in question never got a powered off event.
However, the issue around the first scenario that I saw where we get the event and sent it to the persona is still a scenario we should look at. The logs for the portlayer and persona are attached. The container I looked at was box1.
log.tar.gz box1-4ed574d44bb6.tar.gz hostd.tar.gz
This was also one of the containers that I performed the docker ps, ps, inspect, ps sequence on.
@sflxn what was done w/box1 and what did you observe w/box1? I've done a quick search through 3 portLayer logs and see starts for that VM, but no powerOff event.
per @mlh78750 marking this as a potential ship stopper
@sflxn @cgtexmex may want to look through hostd/vpxd logs as well to see if an event was generated, and did not make it through our eventing layers. The event code in govmomi may also have a bug that could lead to dropped events.
@cgtexmex Do you not see the following in portlayer logs?
PoweredOff Event
----------------
I marked up the portlayer logs. It's the same as what I outlined above, where I posted snippets from portlayer and persona. What I see is events getting handled but docker ps returning stale info. I started looking at why docker ps was getting what seemed to be older data. After I added more logging around that area, I wasn't able to reproduce this same scenario again. That was only 1 or 2 attempts after I added more logging. I'm sure with more time, I can reproduce it.
How to reproduce:
Container:
docker -H $1 --tls create --name box$2 --cpuset-cpus 1 photon sleep 5m
If you want, I can give you my create script. It's very simple, creates 400 containers in photon and sleep for 5 mins.
@cgtexmex I pulled in your page size PR and I no longer see this issue on Lakshmi's infrastructure. This issue may be resolved. I've deployed at least 3 VCH with it and have not seen it again.
can we give @lgayatri a dev build that has the fix so she can re-run her tests?
@mdubya66 @sflxn On the given dev build all 50 containers exited after completing sleep 30m with some difference in timing.
root@Ubuntu-VIc:/home/vmlib/VIC/Loc/bin# docker -H 10.197.37.131:2376 --tls ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
01419828cd0b photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago modest_booth
751de1ac8f4a photon "sleep 30m" 55 minutes ago Exited (0) 21 minutes ago kickass_ardinghelli
bc4b0b91b435 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago elegant_perlman
32a00d77cd95 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago amazing_leakey
78bdb0b5b70d photon "sleep 30m" 55 minutes ago Exited (0) 21 minutes ago peaceful_davinci
b9261f524af8 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago epic_mccarthy
537f3b65919d photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago flamboyant_nightingale
f7b4e7068ed0 photon "sleep 30m" 55 minutes ago Exited (0) 21 minutes ago quizzical_nightingale
b71a9ae6d7d7 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago elegant_kowalevski
16ebbe247fc4 photon "sleep 30m" 55 minutes ago Exited (0) 26 minutes ago reverent_yonath
f26ad690328a photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago compassionate_sammet
e6aeec07d4dc photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago eager_kilby
18c04f83b510 photon "sleep 30m" 55 minutes ago Exited (0) 21 minutes ago sharp_varahamihira
b11757710a7e photon "sleep 30m" 55 minutes ago Exited (0) 26 minutes ago xenodochial_murdock
49fc888e3075 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago mystifying_khorana
f967fa5dafcb photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago inspiring_joliot
580cb0ac91aa photon "sleep 30m" 55 minutes ago Exited (0) 21 minutes ago dazzling_kilby
9d4962ad6c2f photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago objective_saha
8d4b65391b19 photon "sleep 30m" 55 minutes ago Exited (0) 21 minutes ago thirsty_wiles
4a6bcdcabbf2 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago boring_brahmagupta
9bb37ce2ea91 photon "sleep 30m" 55 minutes ago Exited (0) 21 minutes ago ecstatic_keller
fcaa21b862a7 photon "sleep 30m" 55 minutes ago Exited (0) 21 minutes ago stoic_montalcini
3064ddb8b4fb photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago amazing_ardinghelli
e45a4be818a3 photon "sleep 30m" 55 minutes ago Exited (0) 21 minutes ago mystifying_lumiere
8fe33477974e photon "sleep 30m" 55 minutes ago Exited (0) 26 minutes ago gracious_babbage
f5ad666a3c3d photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago determined_gates
844d81e477f6 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago peaceful_jang
35e0dcb1b75e photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago agitated_heisenberg
28b714d5611c photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago cocky_shockley
668e8345f0e9 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago wonderful_franklin
8ed029ecabae photon "sleep 30m" 55 minutes ago Exited (0) 21 minutes ago eager_montalcini
5c79ca8f0fb5 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago xenodochial_shockley
0214f35906b8 photon "sleep 30m" 55 minutes ago Exited (0) 26 minutes ago reverent_albattani
024ad57efb78 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago sleepy_nobel
cc0672b63503 photon "sleep 30m" 55 minutes ago Exited (0) 26 minutes ago objective_shirley
12f54b9ebc08 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago cranky_shirley
ab56f640d7a7 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago nifty_varahamihira
edec7f3046f8 photon "sleep 30m" 55 minutes ago Exited (0) 21 minutes ago hopeful_lamport
7791e4f2767a photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago nervous_pare
e41e3285ec10 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago thirsty_davinci
d9cad9f81f87 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago affectionate_wright
6e551ad503ea photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago sad_haibt
3a24d9405d64 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago practical_rosalind
bd8fc43c42ac photon "sleep 30m" 55 minutes ago Exited (0) 26 minutes ago loving_knuth
e917404575f5 photon "sleep 30m" 55 minutes ago Exited (0) 21 minutes ago tender_khorana
8bb3ea5aa324 photon "sleep 30m" 55 minutes ago Exited (0) 26 minutes ago objective_lovelace
240e86cf76da photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago angry_perlman
c524a9a24aa5 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago youthful_bassi
1dcf053cfdd2 photon "sleep 30m" 55 minutes ago Exited (0) 25 minutes ago practical_bhabha
b9a3efeb898f photon "sleep 30m" 55 minutes ago Exited (0) 21 minutes ago zen_ride
429af4f6ad2f photon "sleep 30m" 58 minutes ago Exited (0) 29 minutes ago adoring_hypatia
based on the positive testing I'm closing as being fixed by #6937
vic-version : v1.2.0-rc5 I started command sleep 5d on photon container. The 5d period completed 43hours back but still I am seeing few containers running and up since 6 days.
Pasting the output: