vmware / vic

vSphere Integrated Containers Engine is a container runtime for vSphere.
http://vmware.github.io/vic
Other
640 stars 174 forks source link

3-02-Docker-Compose-Voting-App worker dies when we expect it to be up #7728

Closed mhagen-vmware closed 6 years ago

mhagen-vmware commented 6 years ago

Normally the output should be:

9W0BXV7XJBXwAA==Waiting for db
Z6mgmV7XJBXwAA==Waiting for db
NbiI1V7XJBXwAA==Waiting for db
XCx+EV/XJBXwAA==Waiting for db
uvm/Xl/XJBUAAQ==Connected to db
/mmYY1/XJBWgAQ==Found redis at 172.17.0.5
VbDyY1/XJBUwAQ==Connected to redis

When the test fails it ends up with internal worker errors like this:

jif5KZfZJBUQCg==System.AggregateException: One or more errors occurred. (No such device or address) ---> System.Net.Internals.InternalSocketException: No such device or address
5ZX6KZfZJBVwBA==   at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
+lAkKpfZJBUwBA==   at System.Net.Dns.EndGetHostAddresses(IAsyncResult asyncResult)
1FUnKpfZJBWQCg==   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
Sl0nKpfZJBXgAg==   --- End of inner exception stack trace ---
oGInKpfZJBXABQ==   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
ImgnKpfZJBVgBQ==   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
xWwnKpfZJBXAAw==   at Npgsql.NpgsqlConnector.Connect(NpgsqlTimeout timeout)
R3EnKpfZJBXAAw==   at Npgsql.NpgsqlConnector.RawOpen(NpgsqlTimeout timeout)
SXYnKpfZJBWQAw==   at Npgsql.NpgsqlConnector.Open(NpgsqlTimeout timeout)
j3onKpfZJBUgBQ==   at Npgsql.ConnectorPool.Allocate(NpgsqlConnection conn, NpgsqlTimeout timeout)
O34nKpfZJBXQAg==   at Npgsql.NpgsqlConnection.OpenInternal()
VYInKpfZJBXwAw==   at Worker.Program.OpenDbConnection(String connectionString)
cYYnKpfZJBWQAg==   at Worker.Program.Main(String[] args)
F4snKpfZJBUgBg==---> (Inner Exception #0) System.Net.Internals.InternalSocketException: No such device or address
3ZAnKpfZJBVwBA==   at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
kpUnKpfZJBUwBA==   at System.Net.Dns.EndGetHostAddresses(IAsyncResult asyncResult)
X5knKpfZJBXQCg==   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)<---
lp4nKpfZJBUQAA==

Which causes the container to exit immediately, so that when we check if it is running later then the test fails because the container has exited.

This has been working in HaaS but fails in Nimbus now. Here is the last failure, but there have been several of these failures before: https://ci-vic.vmware.com/vmware/vic/18318/7

Test-Cases.Group3-Docker-Compose.3-02-Docker-Compose-Voting-App-VCH-18318-7787-container-logs.zip

mhagen-vmware commented 6 years ago

In the working scenario:

Apr 13 2018 00:28:58.319Z DEBUG Found 2 subscribers to 121990: vsphere.VMEvent - worker-993658db8dff on  10.161.149.181 in dc1 is powered on
Apr 13 2018 00:29:03.104Z DEBUG op=347.349: parsing alias :db
Apr 13 2018 00:29:03.104Z DEBUG adding alias votingapp_default:db -> db
Apr 13 2018 00:29:03.104Z DEBUG adding alias votingapp_default:e569ad8957a6 -> db
Apr 13 2018 00:29:05.557Z DEBUG Found 2 subscribers to 122020: vsphere.VMEvent - redis-1bddb2c62ad9 on  10.161.149.181 in dc1 is powered on
Apr 13 2018 00:29:06.311Z DEBUG Found 2 subscribers to 122022: vsphere.VMEvent - db-e569ad8957a6 on  10.161.149.181 in dc1 is powered on
NtvYBl7XJBWAAw==349:M 13 Apr 00:29:32.797 * Ready to accept connections

Apr 13 2018 00:29:32.927Z DEBUG [ END ]  [vic/lib/tether.(*tether).launch:748] [257.358335ms] launching session e569ad8957a68e5a320717cc11be674fc48b67048dbc3c70e8ee9da0f4ab7230
Apr 13 2018 00:29:33.353Z DEBUG [0xc4202d06c0] write "The files belonging to this database system will be owned by user \"postgres\".\nThis user must also own the server process.\n\n" to 2 writers (err: <nil>)
Apr 13 2018 00:29:33.375Z DEBUG [0xc4202d06c0] write "The database cluster will be initialized with locale \"en_US.utf8\".\nThe default database encoding has accordingly been set to \"UTF8\".\nThe default text search configuration will be set to \"english\".\n\nData page checksums are disabled.\n\nfixing permissions on existing directory /var/lib/postgresql/data/data ... " to 2 writers (err: <nil>)

Here I interleaved portlayer, container output and tether. The key event being that the db alias was setup before tether attempted to resolve the DNS for it.

In the failing scenario, tether throws an exception because the query happens before alias is setup:

Apr 13 2018 01:09:50.460Z DEBUG Found 2 subscribers to 156418: vsphere.VMEvent - worker-127d5827b5ce on  10.160.227.24 in dc1 is powered on

Apr 13 2018 01:10:15.091Z DEBUG [ END ]  [vic/lib/tether.(*tether).launch:748] [560.275515ms] launching session 127d5827b5ceed157d8976cddb1ebb4f273a37d28ec7565404e08f69a4f71070
System.AggregateException: One or more errors occurred. (No such device or address) ---> System.Net.Internals.InternalSocketException: No such device or address
   at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
   at System.Net.Dns.EndApr 13 2018 01:10:17.225Z DEBUG [0xc42027f9b0] write "System.AggregateException: One or more errors occurred. (No such device or address) ---> System.Net.Internals.InternalSocketException: No such device or address\n   at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)\n   at System.Net.Dns.End" to 2 writers (err: <nil>)
GetHostAddresses(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at Npgsql.NpgsqlConnector.Connect(NpgsqlTimeout timeout)
   at Npgsql.NpgsqlConnector.RawOpen(NpgsqlTimeout timeout)
   at Npgsql.NpgsqlConnector.Open(NpgsqlTimeout timeout)
   at Npgsql.ConnectorPool.Allocate(NpgsqlConnection conn, NpgsqlTimeout timeout)
   at Npgsql.NpgsqlConnection.OpenInternal()
   at Worker.Program.OpenDbConnection(String connectionString)
   at Worker.Program.Main(String[] args)
---> (Inner Exception #0) System.Net.Internals.InternalSocketException: No such device or address
   at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
   at System.Net.Dns.EndGetHostAddresses(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)<---

Apr 13 2018 01:10:21.685Z DEBUG op=352.354: parsing alias :db
Apr 13 2018 01:10:21.685Z DEBUG adding alias votingapp_default:db -> db
Apr 13 2018 01:10:21.685Z DEBUG adding alias votingapp_default:c983d92a0bf8 -> db
Apr 13 2018 01:10:24.924Z DEBUG Found 2 subscribers to 156472: vsphere.VMEvent - redis-c683b9dab4db on  10.160.227.24 in dc1 is powered on
Apr 13 2018 01:10:24.924Z DEBUG Found 2 subscribers to 156471: vsphere.VMEvent - db-c983d92a0bf8 on  10.160.227.24 in dc1 is powered on
J6tmNJ3ZJBWAAw==358:M 13 Apr 01:10:43.170 * Ready to accept connections

In the passing scenario the alias was setup within 4 seconds of the container powering up, in the failing scenario the alias took over 31 seconds to be setup after powering on the container.