vmware / vic-product

vSphere Integrated Containers enables VMware customers to deliver a production-ready container solution to their developers and DevOps teams.
https://vmware.github.io/vic-product/
Other
177 stars 92 forks source link

Cannot add host to Project: Error 500 #2413

Open dnoliver opened 5 years ago

dnoliver commented 5 years ago

Summary

After following the documentation in https://vmware.github.io/vic-product/assets/files/html/1.5, I cannot add a host to a project.

Admiral cannot communicate with the VCH instance VCH instance logs show errors while trying to stat datastore

Environment information

vSphere 6.7 Single ESXi host 6.7 vCenter Server appliance with embedded Platform controller 6.7 VIC 1.5 VCH deployed with UI Wizard one single datastore a bridge network created with virtual switch default VM Network as public network

vSphere and vCenter Server version

vSphere and vCenter 6.7 update 1

VIC Appliance version

vic-v1.5.2-7206-92ebfaf5

Configuration

Details

Was following the documentation step by step to deploy the first VCH host. VCH host is deployed successfully. vic-machine-linux ls shows the host All green checks in VCH admin portal Used the default-project in admiral, tried to add the VCH host to default-project No TLS being used. Tried to add the host:

Error connecting to http://192.168.0.110:2376: Unexpected error: Connection refused: /192.168.0.110:2376

Using http since the docs say that use http with no TLS. tried several combinations, none of the works.

Changed type from VCH to DOCKER, received error 500.

Inspect logs in VCH admin portal. Several ERROR messages (but UI have all green checks)

Docker Personality log show several times

Apr  1 2019 23:10:04.393Z FATAL failed to initialize backend: [POST /storage][500] CreateImageStore default  &{Code:500 Message:cannot stat '[datastore1] virtual-container-host/VIC/423bf6bd-c91b-3c79-a0ac-ae0b26077784/images': No such file}

Port Layer showing the same:

Apr  1 2019 23:10:04.392Z ERROR op=264.404: Error getting image store 423bf6bd-c91b-3c79-a0ac-ae0b26077784: cannot stat '[datastore1] virtual-container-host/VIC/423bf6bd-c91b-3c79-a0ac-ae0b26077784/images': No such file

No problems in Init log

VIC Admin log show same error several times:

Apr  1 2019 23:10:11.204Z ERROR Process docker-engine-server not running: open /.tether/run/docker-engine-server.pid: no such file or directory
Steps to reproduce

Follow docs to deploy VIC, create VCH Assign VCH to default-project in Admiral

Actual behavior

Cannot establish connection error

Expected behavior

VCH should be added to default-project

Support information

Logs

Not comfortable with posting publicly, private channel is ok

See also

Troubleshooting attempted

dnoliver commented 5 years ago

This looks like a similar https://vmware.github.io/vic-product/assets/files/html/1.5/vic_vsphere_admin/ts_imagestore_error.html error. But in this case, the error message is different, and I did not assigned any container to run yet.

dnoliver commented 5 years ago

Nothing in the KB articles https://kb.vmware.com/s/global-search/%40uri#q=Error%20getting%20image%20store&t=Knowledge&sort=relevancy&f:@commonproduct=[vSphere%20Integrated%20Containers]

dnoliver commented 5 years ago

I have manually browser to the datastore1 folder, and created the images folder in there. At that point, the Docker Personality log reported success

Apr  2 2019 18:52:56.080Z FATAL failed to initialize backend: [POST /storage][500] CreateImageStore default  &{Code:500 Message:500 Internal Server Error}
time="2019-04-02T18:52:59Z" level=info msg="Launching docker personality pprof server on 127.0.0.1:6062" 
Apr  2 2019 18:52:59.356Z ERROR Unable to load CAs for registry access in config
Apr  2 2019 18:52:59.356Z INFO  Waiting for portlayer to come up
Apr  2 2019 18:53:01.358Z INFO  Portlayer is up and responding to pings
Apr  2 2019 18:53:01.358Z INFO  Refreshing repository cache
Apr  2 2019 18:53:01.360Z INFO  Image cache initialized successfully
Apr  2 2019 18:53:01.360Z INFO  Repository cache updated successfully
Apr  2 2019 18:53:01.360Z INFO  Layer cache initialized successfully
Apr  2 2019 18:53:01.361Z INFO  Container cache updated successfully
Apr  2 2019 18:53:01.361Z INFO  Creating image store
Apr  2 2019 18:53:01.362Z INFO  TLS enabled
Apr  2 2019 18:53:01.363Z INFO  Listener created for HTTP on 192.168.0.110//tcp
Apr  2 2019 18:53:01.379Z INFO  API listen on 192.168.0.110:2376

But then I restarted the host, and my images folder previously created was removed again. Apparently, that folder is managed by the VCH host, and will run into the same problem after every reboot

dnoliver commented 5 years ago

After this workaround, I was able to add the host to the project! :)

dnoliver commented 5 years ago

But I am still having storage related problems. Trying to deploy a container in fails with the following error:

Retries are prevented. Failure: Error: Failed to write to image store: [POST /storage/{store_name}][500] WriteImage default \u0026{Code:500 Message:parent (scratch) doesn't exist in http:///storage/images/423bf6bd-c91b-3c79-a0ac-ae0b26077784: file does not exist}; Reason: {"errorDetail":{"message":"Failed to write to image store: [POST /storage/{store_name}][500] WriteImage default \u0026{Code:500 Message:parent (scratch) doesn't exist in http:///storage/images/423bf6bd-c91b-3c79-a0ac-ae0b26077784: file does not exist}"},"error":"Failed to write to image store: [POST /storage/{store_name}][500] WriteImage default \u0026{Code:500 Message:parent (scratch) doesn't exist in http:///storage/images/423bf6bd-c91b-3c79-a0ac-ae0b26077784: file does not exist}"}
dnoliver commented 5 years ago

Tried destroying the VCH host created, and create a new one. I run into the same issue. I cannot add the host to a project because of the same error, and after applying the workaround it allows me to do it. But then, I cannot run a container into the host because of the same error in https://github.com/vmware/vic-product/issues/2413#issuecomment-479134653

wjun commented 5 years ago

@dnoliver We met the similar issues before when the VC user or the opsuser you use to create VCH do not have the privilege to create the datastore folder. Is that your case?

dnoliver commented 5 years ago

I followed this guide to create the vic-ops user https://vmware.github.io/vic-product/assets/files/html/1.5/vic_vsphere_admin/create_ops_user.html

In the https://vmware.github.io/vic-product/assets/files/html/1.5/vic_vsphere_admin/set_up_ops_user.html docs, I saw:

Grant Any Necessary Permissions The operations user account must exist before you create a VCH. If you are deploying the VCH to a cluster, vSphere Integrated Containers Engine can configure the operations user account with all of the necessary permissions for you.

IMPORTANT: The option to grant any necessary permissions automatically only applies when deploying VCHs to clusters. If you are deploying the VCH to a standalone host that is managed by vCenter Server, you must configure the operations user account manually. For information about manually configuring the operations user account, see Manually Create a User Account for the Operations User.

I think I am doing a standalone host deployment, so maybe I have to rather change that to the Cluster deployment, or follow the https://vmware.github.io/vic-product/assets/files/html/1.5/vic_vsphere_admin/ops_user_manual.html to assign permissions to that user if I want to do the standalone host deployment.

I will try that and report results backs. Thank you @wjun!

dnoliver commented 5 years ago

I have definitively a datastore permissions problem for my vic-ops user :) thank you for the hint @wjun

  1. I am not using a cluster, deploying directly to a host. So VCH deployment do not help me with permissions
  2. I went trough the manual permissions docs. I cannot guarantee that I did it correctly, given that there are several clicks to be done. in several places. Custom permissions applied to Root VCenter, Datacenter, and ESXi host. Datastore have custom VCH - endpoint - datastore inherited permission
  3. Created the VCH host again with the Wizard, and I left the "apply permissions" checked. This caused that my vic-user permissions to be modified everywhere with the ones created by the tool, instead of the ones that I spend time assigning manually. My mistake, but somebody could add a warning there!
  4. Same error deploying VCH, workaround applied, but cannot create Container (again)
  5. Applied the same permissions that the wizard override. Cannot guarantee that I did it correctly... failed again.
  6. Gave Administrator permission on the datastore for the vic-ops user, I can deploy containers!
  7. Assigned previous permissions on the datastore for the vic-ops user, I run into problems again. This confirm my permission problems.

My VCH - endpoint - datastore permission looks like this:

dvPort group Modify Policy operation Scope operation

Datastore Allocate space Browse datastore Configure datastore Low level file operations Remove file

Host Configuration System Management

Resource Assign virtual machine to resource pool Migrate powered off virtual machine

Virtual machine Change Configuration Add existing disk Add new disk Add or remove device Advanced configuration Modify device settings Remove disk Rename Edit Inventory Create new Register Remove Unregister Guest operations Guest operation modifications Guest operation program execution Guest operation queries Interaction Connect devices Power off Power on

Then, the more accurate question will be: why do vic-ops user run into datastore permission problems while creating VCH and/or running containers, if it have all the permissions specified by the documentation?

dnoliver commented 5 years ago

I did the same deployment, but now using a cluster, and the problem is still there. I have to manually create the images folder to make the VCH host work for the first time, and I need to add administrative role to vic-user on the datastore to make a successful container deployment. So this permission problem happens regardless of doing a cluster or standalone host deployment

wjun commented 5 years ago

I tried VCH create from CLI onto a VC cluster, and it works. Please note --user is an admin user. --ops-user must be combined with --ops-grant-perms so VCH can assign related permissions to this ops-user automatically.

dnoliver commented 5 years ago

Great @wjun, I have only tested this with the UI Wizard, where I think the --user administrator@vsphere.local is implicit (that is the user I use for log in into vCenter Server). I will give a shot to the CLI command to validate. Thanks!

dnoliver commented 5 years ago

@wjun I have validated the API approach, and I run into the same issue again.

The command used to deploy this VCH was

./vic-machine-linux create --name virtual-container-host-1 \
                                          --compute-resource Cluster \
                                          --image-store 'datastore1 (1)' \
                                          --base-image-size 8GB 
                                          --volume-store 'datastore1 (1):default' \
                                          --bridge-network vic-bridge \
                                          --bridge-network-range 172.16.0.0/12 \
                                          --public-network 'VM Network' \
                                          --tls-cname virtual-container-host-1 \
                                          --certificate-key-size 2048 \
                                          --no-tlsverify --user Administrator@VSPHERE.LOCAL \
                                          --thumbprint <thumb> 
                                          --target 192.168.0.238/Datacenter 
                                          --ops-user vic-ops@vsphere.local 
                                          --ops-grant-perms

The command execution log is below:

INFO[0000] ### Installing VCH ####                      
INFO[0000] vSphere password for vic-ops@vsphere.local:  
INFO[0003] Loaded server certificate virtual-container-host-1/server-cert.pem 
WARN[0003] Configuring without TLS verify - certificate-based authentication disabled 
INFO[0003] Validating supplied configuration            
INFO[0004] Network configuration OK on "vic-bridge"     
INFO[0004] vCenter settings check OK                    
INFO[0004] Firewall status: ENABLED on "/Datacenter/host/Cluster/192.168.0.217" 
INFO[0004] Firewall configuration OK on hosts:          
INFO[0004]  "/Datacenter/host/Cluster/192.168.0.217"    
INFO[0004] vCenter settings check OK                    
INFO[0004] License check OK on hosts:                   
INFO[0004]   "/Datacenter/host/Cluster/192.168.0.217"   
INFO[0004] DRS check OK on:                             
INFO[0004]   "/Datacenter/host/Cluster"                 
WARN[0004] Only one host can access all of the image/volume datastores. This may be a point of contention/performance degradation and HA/DRS may not work as intended. 
INFO[0004]                                              
INFO[0005] Creating Resource Pool "virtual-container-host-1" 
INFO[0005] Creating appliance on target                 
INFO[0005] Network role "client" is sharing NIC with "public" 
INFO[0005] Network role "management" is sharing NIC with "public" 
INFO[0005] Creating the VCH folder                      
INFO[0005] Creating the VCH VM                          
INFO[0006] Creating directory [datastore1 (1)] VIC      
INFO[0006] Datastore path is [datastore1 (1)] VIC       
INFO[0007] Uploading ISO images                         
INFO[0008] Uploading appliance.iso as V1.5.2-20879-30B67A14-appliance.iso 
INFO[0027] Uploading bootstrap.iso as V1.5.2-20879-30B67A14-bootstrap.iso 
INFO[0045] Waiting for IP information                   
INFO[0052] Waiting for major appliance components to launch 
INFO[0052] Obtained IP address for client interface: "192.168.0.199" 
INFO[0052] Checking VCH connectivity with vSphere target 
INFO[0052] vSphere API Test: https://192.168.0.238 vSphere API target responds as expected 
ERRO[0225] vic/lib/install/management.(*Dispatcher).CheckDockerAPI: Create error: context deadline exceeded
vic/cmd/vic-machine/create.(*Create).Run:755 Create
vic/cmd/vic-machine/common.NewOperation:27 vic-machine-linux 
INFO[0225] Docker API endpoint check failed: context deadline exceeded 
INFO[0225] Collecting 598fc05d-88d3-4d9b-8c5a-f55a274e2db1 vpxd.log 
INFO[0225]  API may be slow to start - try to connect to API after a few minutes: 
INFO[0225]      Run command: docker -H 192.168.0.199:2376 --tls info 
INFO[0225]      If command succeeds, VCH is started. If command fails, VCH failed to install - see documentation for troubleshooting. 
ERRO[0225] vic/cmd/vic-machine/create.(*Create).Run.func3: Create error: context deadline exceeded
vic/cmd/vic-machine/create.(*Create).Run:755 Create
vic/cmd/vic-machine/common.NewOperation:27 vic-machine-linux 
ERRO[0225] --------------------                         
ERRO[0225] vic-machine-linux create failed: Creating VCH exceeded time limit of 3m0s. Please increase the timeout using --timeout to accommodate for a busy vSphere target

At least this time I have an error! and not the silent error that the UI Wizard run. In the Docker personality log, I can see the same issue as before:

Apr  8 2019 20:26:07.096Z FATAL failed to initialize backend: [POST /storage][500] CreateImageStore default  &{Code:500 Message:cannot stat '[datastore1 (1)] virtual-container-host-1/VIC/423b9844-dafc-a118-d910-f6ce4a309745/images': No such file}

And I am sure the workaround still apply. If I create the /images directory manually, and assign admin permissions in the datastore to the vic-ops users, this will start working.

I also tried to deploy a VCH keeping the Administrator access for vic-ops in the datastore, and removing the --ops-grant-perms parameter, and it runs into the same issue as before:

Apr  8 2019 20:26:07.096Z FATAL failed to initialize backend: [POST /storage][500] CreateImageStore default  &{Code:500 Message:cannot stat '[datastore1 (1)] virtual-container-host-1/VIC/423b9844-dafc-a118-d910-f6ce4a309745/images': No such file}

So this /images folder error could be not a permissions problem (at least in the datastore), but something else. The Administrator permissions seems to solve the second issue when trying to run a container, but not the initial creation of the images folder.

wjun commented 5 years ago

@dnoliver I tried various combinations of ops-user and datastores, and cannot reproduce in my local env. Could you post your portlayer.log as well where there should be error messages related to images directory creation failure? Another option is to remove --ops-user and --ops-grant-perms during VCH create first and see if you can reproduce the issue or not.

dnoliver commented 5 years ago

Ok, will try to share the portlayer.log file.

The only special thing about my installation is that it is using VM Encryption. I have a KMS, and encryption storage policy, and a couple of encrypted VMs running in the same host. Is that something relevant to this issue?

bosco777 commented 2 years ago

I hate to comment on an old thread, but I have vSAN encryption with vCenter KMS, and experienced the same problem with the 'grant all permissions needed' option, and needing to create the images folder manually for this to work. So this still seems to be an issue.