vmware / vic

vSphere Integrated Containers Engine is a container runtime for vSphere.
http://vmware.github.io/vic
Other
640 stars 173 forks source link

bridge vswitch doesn't exist but cannot be created because it already exists #7010

Open hickeng opened 6 years ago

hickeng commented 6 years ago

We are seeing a failure unique to builds-14.vmware.run where we fail to create a vswitch called bridge with the error that it already exists. However just prior to this we check for it's existence and get the debug message seen below no such network "bridge".

Hypothesis: We have aborted during VCH creation after creating a vswitch but before creating the associated port group and we're only searching for a portgroup of the specified name. Search: https://github.com/vmware/vic/blob/e0c8ff5b6a02a4d5c8a04817616fb4f49353962c/lib/install/validate/network.go#L276 Create: https://github.com/vmware/vic/blob/e0c8ff5b6a02a4d5c8a04817616fb4f49353962c/lib/install/management/network.go#L56

Acceptance:

Details: vic-machine.log

INFO[0000] ### Installing VCH ####
INFO[0000] vSphere password for root:
WARN[0007] Using administrative user for VCH operation - use --ops-user to improve security (see -x for advanced help)
DEBU[0007] client network: IP {<nil> <nil>} gateway <nil> dest: []
DEBU[0007] public network: IP {<nil> <nil>} gateway <nil> dest: []
DEBU[0007] management network: IP {<nil> <nil>} gateway <nil> dest: []
DEBU[0007] VCH DNS servers: []
DEBU[0007] Unable to locate existing server certificate in cert path
INFO[0007] Generating self-signed certificate/key pair - private key in ghicken/server-key.pem
WARN[0007] Configuring without TLS verify - certificate-based authentication disabled
DEBU[0007] --target=https://builds-14.vmware.run
DEBU[0007] --thumbprint=E7:FE:47:0F:56:4B:2D:DC:76:6C:66:DF:40:24:9E:E9:4C:64:0A:C5
DEBU[0007] --name=ghicken
DEBU[0007] --compute-resource=(*flag.stringValue)(0xc4201d0010)
DEBU[0007] --image-store=datastore1/ghicken
DEBU[0007] --volume-store=datastore1/ghicken-vols:default
DEBU[0007] --bridge-network=bridge
DEBU[0007] --public-network=vm-network
DEBU[0007] --no-tlsverify=true
DEBU[0007] --timeout=1m30s
DEBU[0007] --debug=1
DEBU[0007] Creating VMOMI session with thumbprint E7:FE:47:0F:56:4B:2D:DC:76:6C:66:DF:40:24:9E:E9:4C:64:0A:C5
DEBU[0008] Session Environment Info:                     API Type=HostAgent API Version=6.0 Build=5572656 Name="VMware ESXi" OS Type=vmnix-x86 Product ID=embeddedEsx UUID= Vendor="VMware,
DEBU[0008] vSphere resource cache populating...
DEBU[0009] Cached dc:
DEBU[0010] Cached cluster:
DEBU[0010] Cached ds:
DEBU[0011] Cached host:
DEBU[0012] Cached pool:
DEBU[0012] Cached folders:
DEBU[0012] vSphere resource cache populated...
INFO[0012] Validating supplied configuration
DEBU[0012] Setting scratch image size to 8000000 KB in VCHConfig
DEBU[0014] Using default resource pool for compute resource: "/ha-datacenter/host/ns539600.ip-144-217-72.net/Resources"
DEBU[0017] no such network "bridge"
DEBU[0017] Network configuration:
DEBU[0017]      Network: public NetworkEndpoint: &{{  public } false <nil> {<nil> <nil>}  [] false}
DEBU[0017]      Network: client NetworkEndpoint: &{{  client } false <nil> {<nil> <nil>}  [] false}
DEBU[0017]      Network: management NetworkEndpoint: &{{   } false <nil> {<nil> <nil>}  [] false}
DEBU[0017]      Network: bridge NetworkEndpoint: &{{  bridge } true <nil> {<nil> <nil>}  [] false}
DEBU[0017] vCenter settings check SKIPPED
DEBU[0017] Checking firewall with management network IP {<nil> <nil>}
INFO[0020] Firewall status: DISABLED on "/ha-datacenter/host/ns539600.ip-144-217-72.net/ns539600.ip-144-217-72.net"
DEBU[0020] fw misconfigured with fw disabled "/ha-datacenter/host/ns539600.ip-144-217-72.net/ns539600.ip-144-217-72.net"
WARN[0020] Firewall configuration will be incorrect if firewall is reenabled on hosts:
WARN[0020]      "/ha-datacenter/host/ns539600.ip-144-217-72.net/ns539600.ip-144-217-72.net"
INFO[0020] Firewall must permit dst 2377/tcp outbound to VCH management interface if firewall is reenabled
DEBU[0020] vCenter settings check SKIPPED
INFO[0020] License check OK
INFO[0020] DRS check SKIPPED - target is standalone host
DEBU[0020] Configuring without client verification due to empty certificate authorities
DEBU[0024] Datacenter: "ha-datacenter", Cluster: "ComputeResource:ha-compute-res", Resource Pool: ""
INFO[0024]
INFO[0029] Creating Resource Pool "ghicken"
INFO[0030] Creating VirtualSwitch
INFO[0030] Collecting ha-host hostd.log
ERRO[0030] Failed to add virtual switch ("bridge"): ServerFaultCode: The specified key, name, or identifier 'bridge' already exists.
ERRO[0031] --------------------
ERRO[0031] vic-machine-linux create failed: Failed to add virtual switch ("bridge"): ServerFaultCode: The specified key, name, or identifier 'bridge' already exists.

hostd.log

Accepted password for user root from 198.27.220.171
2017-12-29T19:32:16.079Z info hostd[22EC1B70] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=2760.1-d223] Event 46693 : User root@198.27.220.171 logged in as vic-machine/1.3.0-rc1
[LikewiseGetDomainJoinInfo:355] QueryInformation(): ERROR_FILE_NOT_FOUND (2/0):
Accepted password for user root from 198.27.220.171
2017-12-29T19:32:21.828Z info hostd[259C1B70] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=2760.1-d237] Event 46694 : User root@198.27.220.171 logged in as vice-validator
2017-12-29T19:32:22.103Z info hostd[21413B70] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=2760.1-d238 user=root] Event 46695 : User root@198.27.220.171 logged out (login time: Friday, 29 December, 2017 19:32:21, number of API invocatio
2017-12-29T19:32:27.435Z error hostd[259C1B70] [Originator@6876 sub=Solo.VmwareCLI opID=8a3dd24e user=root] GetPrimitiveParam: Cannot find (help)
2017-12-29T19:32:27.435Z info hostd[259C1B70] [Originator@6876 sub=Solo.VmwareCLI opID=8a3dd24e user=root] Dispatch get
2017-12-29T19:32:27.436Z info hostd[259C1B70] [Originator@6876 sub=Solo.VmwareCLI opID=8a3dd24e user=root] Dispatch get done
2017-12-29T19:32:37.501Z info hostd[259C1B70] [Originator@6876 sub=Vimsvc.TaskManager opID=2760.1-d276 user=root] Task Created : haTask-ha-root-pool-vim.ResourcePool.createResourcePool-833080
2017-12-29T19:32:37.501Z info hostd[259C1B70] [Originator@6876 sub=Hostsvc.ResourcePool pool419 opID=2760.1-d276 user=root] Resource pool instantiated
2017-12-29T19:32:37.503Z info hostd[259C1B70] [Originator@6876 sub=Vimsvc.TaskManager opID=2760.1-d276 user=root] Task Completed : haTask-ha-root-pool-vim.ResourcePool.createResourcePool-833080 Status success
2017-12-29T19:32:38.029Z info hostd[259C1B70] [Originator@6876 sub=Vimsvc.TaskManager opID=2760.1-d278 user=root] Task Created : haTask-ha-host-vim.host.NetworkSystem.addVirtualSwitch-833081
2017-12-29T19:32:38.030Z info hostd[259C1B70] [Originator@6876 sub=Default opID=2760.1-d278 user=root] AdapterServer caught exception: vim.fault.AlreadyExists
2017-12-29T19:32:38.030Z info hostd[259C1B70] [Originator@6876 sub=Vimsvc.TaskManager opID=2760.1-d278 user=root] Task Completed : haTask-ha-host-vim.host.NetworkSystem.addVirtualSwitch-833081 Status error
2017-12-29T19:32:38.030Z info hostd[259C1B70] [Originator@6876 sub=Solo.Vmomi opID=2760.1-d278 user=root] Activation [N5Vmomi10ActivationE:0x2616fa78] : Invoke done [addVirtualSwitch] on [vim.host.NetworkSystem:networkSystem]
2017-12-29T19:32:38.030Z verbose hostd[259C1B70] [Originator@6876 sub=Solo.Vmomi opID=2760.1-d278 user=root] Arg vswitchName:
--> "bridge"
2017-12-29T19:32:38.030Z verbose hostd[259C1B70] [Originator@6876 sub=Solo.Vmomi opID=2760.1-d278 user=root] Arg spec:
--> (vim.host.VirtualSwitch.Specification) {
-->    numPorts = 1024,
-->    bridge = (vim.host.VirtualSwitch.Bridge) null,
-->    policy = (vim.host.NetworkPolicy) null,
-->    mtu = <unset>
--> }
2017-12-29T19:32:38.030Z info hostd[259C1B70] [Originator@6876 sub=Solo.Vmomi opID=2760.1-d278 user=root] Throw vim.fault.AlreadyExists
2017-12-29T19:32:38.030Z info hostd[259C1B70] [Originator@6876 sub=Solo.Vmomi opID=2760.1-d278 user=root] Result:
--> (vim.fault.AlreadyExists) {
-->    faultCause = (vmodl.MethodFault) null,
-->    name = "bridge",
-->    msg = ""
--> }
2017-12-29T19:32:38.574Z info hostd[22EC1B70] [Originator@6876 sub=Vimsvc.ha-eventmgr opID=2760.1-d27a user=root] Event 46696 : User root@198.27.220.171 logged out (login time: Friday, 29 December, 2017 19:32:15, number of API invocatio
[LikewiseGetDomainJoinInfo:355] QueryInformation(): ERROR_FILE_NOT_FOUND (2/0):
[LikewiseGetDomainJoinInfo:355] QueryInformation(): ERROR_FILE_NOT_FOUND (2/0):

govc host.vswitch.info shows that we have a vswitch but no associated portgroup (the VCH bridge portgroups are a test cleanup failure, but not unique to this host):

Name:                    vSwitch0
Portgroup:               VM Network WAN, Management Network WAN
Pnic:                    vmnic0
MTU:                     1500
Ports:                   1792
Ports Available:         1780
Allow promiscuous mode:  No
Allow forged transmits:  Yes
Allow MAC changes:       Yes

Name:                    vSwitchLAN
Portgroup:               VM Network, Management Network, VCH-15525-6974-bridge, VCH-15525-8575-bridge, VCH-15525-4081-bridge, VCH-15525-8399-bridge, VCH-15525-5207-bridge, VCH-15525-5857-bridge, VCH-15525-4925-bridge, VCH-15525-5161-bridge, VCH-15525-8277-bridge, VCH-15525-8522-bridge, VCH-15525-5076-bridge, VCH-15525-9544-bridge, VCH-15525-5609-bridge, VCH-15525-7689-bridge, VCH-15525-5411-bridge, VCH-15525-7139-bridge, VCH-15525-6181-bridge, VCH-15525-8479-bridge, VCH-15525-5162-bridge, VCH-15525-1286-bridge, VCH-15525-6130-bridge, VCH-15525-1427-bridge, VCH-15525-7229-bridge, VCH-15525-3917-bridge, VCH-15525-6809-bridge, VCH-15525-9126-bridge, VCH-15525-6327-bridge, VCH-15525-2544-bridge, VCH-15525-4634-bridge, VCH-15525-6859-bridge, VCH-15525-3052-bridge, VCH-15525-5840-bridge, VCH-15525-6653-bridge, VCH-15525-3715-bridge, VCH-15525-7468-bridge, VCH-15525-8200-bridge, VCH-15525-3824-bridge, VCH-15525-6525-bridge, VCH-15525-9265-bridge, VCH-15525-8054-bridge, VCH-15525-1467-bridge, VCH-15525-7550-bridge, VCH-15525-9283-bridge, VCH-15525-5475-bridge, VCH-15525-5430-bridge, VCH-15525-5788-bridge, VCH-15525-2349-bridge, VCH-15525-5340-bridge, VCH-15525-3349-bridge, VCH-15525-2234-bridge, VCH-15525-4013-bridge, VCH-15525-7457-bridge, VCH-15525-7616-bridge, VCH-15525-6082-bridge, VCH-15525-2308-bridge, VCH-15525-6595-bridge, VCH-15525-1272-bridge, VCH-15525-2763-bridge, VCH-15525-6708-bridge, VCH-15525-4447-bridge, VCH-15525-2461-bridge, VCH-15525-5945-bridge, VCH-15525-7049-bridge, VCH-15525-1770-bridge, vm-network, management, VCH-15525-8107-bridge
Pnic:                    vmnic1
MTU:                     1500
Ports:                   1792
Ports Available:         1780
Allow promiscuous mode:  Yes
Allow forged transmits:  Yes
Allow MAC changes:       Yes

Name:                    bridge
Portgroup:
Pnic:
MTU:                     1500
Ports:                   1792
Ports Available:         1780
Allow promiscuous mode:  No
Allow forged transmits:  Yes
Allow MAC changes:       Yes

vic-cmd output shows that we have a vswitch called bridge:

[root@ns539600:~] vim-cmd hostsvc/net/vswitch_info bridge
(vim.host.VirtualSwitch) {
   name = "bridge",
   key = "key-vim.host.VirtualSwitch-bridge",
   numPorts = 1792,
   numPortsAvailable = 1780,
   mtu = 1500,
   spec = (vim.host.VirtualSwitch.Specification) {
      numPorts = 1024,
      bridge = (vim.host.VirtualSwitch.Bridge) null,
      policy = (vim.host.NetworkPolicy) {
         security = (vim.host.NetworkPolicy.SecurityPolicy) {
            allowPromiscuous = false,
            macChanges = true,
            forgedTransmits = true
         },
         nicTeaming = (vim.host.NetworkPolicy.NicTeamingPolicy) {
            policy = "loadbalance_srcid",
            reversePolicy = true,
            notifySwitches = true,
            rollingOrder = false,
            failureCriteria = (vim.host.NetworkPolicy.NicFailureCriteria) {
               checkSpeed = "minimum",
               speed = 10,
               checkDuplex = false,
               fullDuplex = false,
               checkErrorPercent = false,
               percentage = 0,
               checkBeacon = false
            },
            nicOrder = (vim.host.NetworkPolicy.NicOrderPolicy) {
            }
         },
         offloadPolicy = (vim.host.NetOffloadCapabilities) {
            csumOffload = true,
            tcpSegmentation = true,
            zeroCopyXmit = true
         },
         shapingPolicy = (vim.host.NetworkPolicy.TrafficShapingPolicy) {
            enabled = false,
            averageBandwidth = <unset>,
            peakBandwidth = <unset>,
            burstSize = <unset>
         }
      },
      mtu = <unset>
   },
}
hickeng commented 6 years ago

Using the following patch we get the body of the fault:

DEBU[0024] Unexpected soap fault type: {VimFault:{MethodFault:{FaultCause:<nil> FaultMessage:[]}} Name:bridge}
index 449808e..958b7b8 100644
--- a/lib/install/management/network.go
+++ b/lib/install/management/network.go
@@ -19,6 +19,8 @@ import (

        "github.com/vmware/govmomi/find"
        "github.com/vmware/govmomi/object"
+       "github.com/vmware/govmomi/task"
+       "github.com/vmware/govmomi/vim25/soap"
        "github.com/vmware/govmomi/vim25/types"
        "github.com/vmware/vic/lib/config"
        "github.com/vmware/vic/pkg/errors"
@@ -53,8 +55,41 @@ func (d *Dispatcher) createBridgeNetwork(conf *config.VirtualContainerHostConfig
        if err = hostNetSystem.AddVirtualSwitch(d.op, name, &types.HostVirtualSwitchSpec{
                NumPorts: 1024,
        }); err != nil {
-               err = errors.Errorf("Failed to add virtual switch (%q): %s", name, err)
-               return err
+               handled := false
+               if soap.IsSoapFault(err) {
+                       switch f := soap.ToSoapFault(err).VimFault().(type) {
+                       case *types.AlreadyExists:
+                               handled = true
+                               d.op.Debugf("vSwitch already exists with name (soapfault): %s", name)
+                       default:
+                               d.op.Debugf("Unexpected soap fault type: %+v", f)
+                       }
+               }
+
+               if soap.IsVimFault(err) {
+                       switch f := soap.ToVimFault(err).(type) {
+                       case *types.AlreadyExists:
+                               handled = true
+                               d.op.Debugf("vSwitch already exists with name (vimfault): %s", name)
+                       default:
+                               d.op.Debugf("Unexpected vim fault type: %+v", f)
+                       }
+               }
+
+               switch err := err.(type) {
+               case task.Error:
+                       switch f := err.Fault().(type) {
+                       case *types.TaskInProgress:
+                               d.op.Debugf("vSwitch already exists with name (fault): %s", name)
+                       default:
+                               d.op.Debugf("Unexpected vim fault type: %+v", f)
+                       }
+               }
+
+               if !handled {
+                       err = errors.Errorf("Failed to add virtual switch (%q): %s", name, err)
+                       return err
+               }
        }

        d.op.Infof("Creating Portgroup")

Suggest that we modify https://github.com/vmware/govmomi/blob/d70eca22cace05a6ed7a5787ecd660724d65d911/object/host_network_system.go#L85 so that the AlreadyExists fault is propagated and use that to modify behaviour. (@dougm for thoughts? See the hostd.log output for server side fault details)

As an alternative we can use a Finder call first to determine if the switch already exists.