vmware / vic

vSphere Integrated Containers Engine is a container runtime for vSphere.
http://vmware.github.io/vic
Other
640 stars 173 forks source link

Error deploying containers on VCH 0.6.0 #3031

Closed vmwarelab closed 7 years ago

vmwarelab commented 7 years ago
// A self-contained demonstration of the problem follows...

when executing the following from the client machine against VCH
docker -H 192.168.0.69:2376 --tls  run -p 80:80 vmwarecna/nginx 

Expected behavior:

run the nginx container

Actual behavior:

I get the following erorr docker: Failed to write to image store: [POST /storage/{store_name}][500] WriteImage default &{Code:0xc42020d2d0 Message:parent (scratch) doesn't exist in http://VCH01/storage/images/421de96b-21dd-8489-9624-2e10936158c7: cannot stat '[MGMT-LocalDisk1] VCH01/VIC/421de96b-21dd-8489-9624-2e10936158c7/images/scratch/manifest': No such file}.

Any help is appreciated. Thank you

mlh78750 commented 7 years ago

@vmwarelab Can you please collect the logs from the vic-admin page? And do you still happen to have the command line params you used to install the VCH?

vmwarelab commented 7 years ago

container-logs (1).zip

command i have used

 ./vic-machine-linux create --name VCH01 -t 'administrator@vmwarelab.local:PasswordHere@192.168.0.20' --compute-resource 'Management CL' --external-network dvPortGroup-Management-Network --bridge-network dvPortGroup-Bridge-VCH --image-store MGMT-LocalDisk1

i also found out from the documentation that the VIC plugin is not functional in VCH 0.6.0 release https://vmware.github.io/vic/assets/files/html/vic_installation/plugin_verify_deployment.html

mlh78750 commented 7 years ago

From personality log: time="2016-11-06T21:29:47Z" level=info msg="Creating image store" time="2016-11-06T21:29:51Z" level=error msg="Failed to create image store" time="2016-11-06T21:29:51Z" level=fatal msg="failed to initialize backend: [POST /storage][500] CreateImageStore default &{Code:0xc42036b248 Message:Put https://esx-01a.vmwarelab.local/folder/VCH01/VIC/421df1e6-9de7-d684-b8fb-384f2e2fc1c5/images/scratch/manifest?dsName=MGMT-LocalDisk1: dial tcp: lookup esx-01a.vmwarelab.local: no such host}"

cc: @jzt

@vmwarelab: can you try with the 0.7.0 release? We are going to root cause this for .6.0 but I'd like to see if you still have a problem with .7.0. If you do have the problem still in .7.0 will you please collect logs as well. .7.0 saw quite a few changes in the image storage code.

vmwarelab commented 7 years ago

@mlh78750 this is bit different. here is the command i m using

  ./vic-machine-linux create --name VCH01 --target  'https://administrator@vmwarelab.local:*****@192.168.0.20/VMWareLab DC' --compute-resource 'Management CL' --external-network dvPortGroup-Management-Network --bridge-network dvPortGroup-Bridge-VCH --image-store MGMT-LocalDisk1 --no-tls

but i m getting a different issue now

Failed to verify certificate for target=192.168.0.20 (thumbprint=AA:A2:7F:E1:2E:98:00:F6:24:6B:8B:73:E8:DC:EF:65:F2:D3:40:5A) Create cannot continue: failed to create validator

fdawg4l commented 7 years ago

@vmwarelab

I'm suspecting this isn't 0.7.0. What version of VIC did you download?

dial tcp: lookup esx-01a.vmwarelab.local: no such host

This also looks a bit suspect. When you log into the esx or VC UI, do you see any alerts related to this?

vmwarelab commented 7 years ago

@fdawg4l 0.7.0 downloaded from here https://bintray.com/vmware/vic/Download/v0.7.0

fdawg4l commented 7 years ago

Yeah, that's definitely the right place.

jzt commented 7 years ago

@vmwarelab You can fix this by either specifying the thumbprint, like so: -thumbprint=E1:E1:61:DF:72:8F:2B:EB:6B:85:B4:13:62:9C:2B:DF:A6:CC:9C:5B, or by using --force as part of the vic-machine arguments.

mhagen-vmware commented 7 years ago

Pulling this into 0.8 for now, we need to further triage prior to release.

vmwarelab commented 7 years ago

the 0.7.0 seems harder to install even with --no-tls , --force

i m using the following command now and it does deploy the VCH but the script never finishes properly. also i noticed it doesn't get to create the vic folder within the VCH folder created on the datastore like it used to when using VCH 0.6.0

  ./vic-machine-linux create --name VCH01 --target  'https://administrator@vmwarelab.local:*****@192.168.0.20/VMWareLab DC' --compute-resource 'Management CL' --external-network dvPortGroup-Management-Network --bridge-network dvPortGroup-Bridge-VCH --image-store MGMT-LocalDisk1 --no-tls --force

and its timing out now throwing this error

ERRO[2016-11-07T14:35:37-05:00] Property collector error: context deadline exceeded ERRO[2016-11-07T14:35:37-05:00] Unable to wait for extra config property guestinfo.vice..init.networks|client.assigned.IP: context deadline exceeded INFO[2016-11-07T14:35:37-05:00] Unable to get vm config: context deadline exceeded INFO[2016-11-07T14:35:37-05:00] Failed to retrieve IP for client interface INFO[2016-11-07T14:35:37-05:00] State of all interfaces: INFO[2016-11-07T14:35:37-05:00] "external" IP: "waiting for IP" INFO[2016-11-07T14:35:37-05:00] "client" IP: "waiting for IP" INFO[2016-11-07T14:35:37-05:00] "management" IP: "waiting for IP" INFO[2016-11-07T14:35:37-05:00] "bridge" IP: "waiting for IP" INFO[2016-11-07T14:35:37-05:00] State of components: INFO[2016-11-07T14:35:37-05:00] "vicadmin": "" INFO[2016-11-07T14:35:37-05:00] "docker-personality": "" INFO[2016-11-07T14:35:37-05:00] "port-layer": "" INFO[2016-11-07T14:35:37-05:00] Collecting C7FB0A62-9D41-4447-A619-5D298B91366C vpxd.log ERRO[2016-11-07T14:35:37-05:00] Failed to collect C7FB0A62-9D41-4447-A619-5D298B91366C vpxd.log: context canceled WARN[2016-11-07T14:35:37-05:00] No log data for C7FB0A62-9D41-4447-A619-5D298B91366C vpxd.log ERRO[2016-11-07T14:35:37-05:00] -------------------- ERRO[2016-11-07T14:35:37-05:00] vic-machine-linux failed: Create timed out: use --timeout to add more time

mlh78750 commented 7 years ago

@vmwarelab For your original issue, are you deploying to a cluster of more than one host for the compute resource but to local storage on only one host?

For the .7.0 issue do you have DHCP available? If you don't you'll need to specify an IP. It looks like you're not getting any addresses. Is this a nested environment?

vmwarelab commented 7 years ago

@mlh78750 i m deploying to a cluster with one physical host (not nested) with local storage Yes and DHCP is available as each of my deployment had an IP ..

i have changed how i specify the target by using

 'https://administrator@vmwarelab.local:****@192.168.0.20/VMWareLab DC'

instead of

--target 'https://192.168.0.20/VMWareLab DC' --user administrator@vmwarelab.local --password ****

but again its still times out with a different output

INFO[2016-11-07T15:49:12-05:00] Waiting for IP information INFO[2016-11-07T15:49:38-05:00] Waiting for major appliance components to launch INFO[2016-11-07T15:51:50-05:00] Collecting C7FB0A62-9D41-4447-A619-5D298B91366C vpxd.log ERRO[2016-11-07T15:51:50-05:00] Failed to collect C7FB0A62-9D41-4447-A619-5D298B91366C vpxd.log: context deadline exceeded WARN[2016-11-07T15:51:50-05:00] No log data for C7FB0A62-9D41-4447-A619-5D298B91366C vpxd.log ERRO[2016-11-07T15:51:50-05:00] -------------------- ERRO[2016-11-07T15:51:50-05:00] vic-machine-linux failed: Create timed out: use --timeout to add more time

mlh78750 commented 7 years ago

@vmwarelab your PG dvPortGroup-Management-Network can reach the esx host as well? What is serving DHCP for you? it looks like your DNS settings from DHCP didn't make it into the appliance and I'm trying to figure out how that could happen.

mlh78750 commented 7 years ago

cc @hmahmood

vmwarelab commented 7 years ago

@mlh78750 Yes.. of course its one flat network . my DHCP is my provider's router and it has no awareness of any DNS ? i dont think though thats the problem.. cause i had no issue deploying the VCH 0.6.0 release.

is there a way to specify the dns manually within the vic-machine-linux command ? to my knowledge you can no longer use the --dns-server parameter

hmahmood commented 7 years ago

@vmwarelab --dns-server should work if you specify a static ip for the external network (via --external-network-ip. I believe otherwise we get the DNS servers from DHCP. The change that does not request DNS servers from DHCP servers did not make it to 0.7.0, but should be present in the upcoming 0.8.0 release.

Can you give --dns-server a shot in any case?

vmwarelab commented 7 years ago

@hmahmood it accepted the parameter --dns-server but still timed out. if i check the datastore , it doesnt get to creating the vic folder with all its contents . i only see a new folder called kvStores

mlh78750 commented 7 years ago

@vmwarelab can you run the install with --debug=1 and post the output please.

mlh78750 commented 7 years ago

@vmwarelab And my comment from an hour ago was trying to figure out what happened with .6.0. We can focus on getting your .7.0 setup.

hmahmood commented 7 years ago

@vmwarelab as far DNS behavior is concerned that is expected with 0.6.0 and 0.7.0. As long as you don't specify static ip for external network DNS servers will continue to get populated from DHCP, and --dns-server will have no effect. This will be fixed with 0.8.0.

vmwarelab commented 7 years ago

@mlh78750 here is output at the end . there are tons of connection refused .. i just copied the end of the trail

DEBU[2016-11-08T13:32:10-05:00] Components not yet initialized, retrying DEBU[2016-11-08T13:32:10-05:00] connection refused DEBU[2016-11-08T13:32:11-05:00] Components not yet initialized, retrying DEBU[2016-11-08T13:32:11-05:00] connection refused time=2016-11-08T13:32:11.742321376-05:00 level=debug msg=[ END ] [github.com/vmware/vic/lib/install/management.(Dispatcher).CheckDockerAPI:678] [2m19.565902706s] time=2016-11-08T13:32:11.742500723-05:00 level=debug msg=[BEGIN] [github.com/vmware/vic/lib/install/management.(Dispatcher).CollectDiagnosticLogs:165] INFO[2016-11-08T13:32:11-05:00] Collecting C7FB0A62-9D41-4447-A619-5D298B91366C vpxd.log ERRO[2016-11-08T13:32:11-05:00] Failed to collect C7FB0A62-9D41-4447-A619-5D298B91366C vpxd.log: context deadline exceeded WARN[2016-11-08T13:32:11-05:00] No log data for C7FB0A62-9D41-4447-A619-5D298B91366C vpxd.log time=2016-11-08T13:32:11.744234572-05:00 level=debug msg=[ END ] [github.com/vmware/vic/lib/install/management.(*Dispatcher).CollectDiagnosticLogs:165] [1.734302ms] ERRO[2016-11-08T13:32:11-05:00] -------------------- ERRO[2016-11-08T13:32:11-05:00] vic-machine-linux failed: Create timed out: use --timeout to add more time

vmwarelab commented 7 years ago

Looks like we have a winner

./vic-machine-linux create --name VCH01 -t 'administrator@vmwarelab.local:*****@192.168.0.20' --compute-resource 'Management CL' --external-network dvPortGroup-Management-Network --bridge-network dvPortGroup-Bridge-VCH --image-store MGMT-LocalDisk1 --force --no-tlsverify --dns-server 192.168.0.10 --external-network-ip 192.168.0.22 --external-network-gateway 192.168.0.1/24 --debug=1

this command deployed the VCH successfully. now i can test the pulling and running containers so back to the original issue. also it seems that the VIC folder is created on the datastore with the manifest file which wasnt there in the 0.6.0 release. since it was looking for this file when trying to pull and run a container

mlh78750 commented 7 years ago

@vmwarelab Looking through your original issue on 0.6.0 it looks like the VCH appliance was somehow trying to connect to a FQDN called esx-01a.vmwarelab.local and could no resolve it. However your install command was only using an IP for the target. Hence the questions about your DHCP And DNS setup. If you target a FQDN the VCH appliance needs to be able to resolve that name as well to connect to the vSphere infrastructure to manage the containers, storage, etc.. It looks like on your original issue with 0.6.0 that somehow the host identifier was just a name and the VCH couldn't resolve that name. Does your VC use the names for the hosts and actually have DNS that resolves those names for the VC? Because the VCH needs to point to the same DNS server if you're setup like that.

And I suspect that is why the install on 0.7.0 worked with the specified DNS server in the command line. You said your DHCP was from your provider which I assume means that the DNS provided in the DHCP offer will also point to that provider.

mlh78750 commented 7 years ago

@vmwarelab Wanted you to be aware of the current networking limitations for the management interface on the VCH. It should work well if it is L2 adjacent to the vsphere endpoints, but if not, you will need to implement the work around in this issue: #3081.

vmwarelab commented 7 years ago

@mlh78750 your absolutely right . that would be a great idea to provide better feedback so the users knows its a DNS related issue. .. thank you for your help on this. i still have to test the initial issue.

vmwarelab commented 7 years ago

@mlh78750 so back to the original issue and testing running a simple container using the following command on the client machine against the vch host

 docker -H 192.168.0.22:2376 --tls run -d -p 80:80 vmwarecna/nginx

failed with with a different error using 0.7.0

docker: Error response from daemon: No volume store named (default) exists.

jzt commented 7 years ago

@vmwarelab Please see this doc: https://github.com/vmware/vic/blob/master/doc/user_doc/vic_app_dev/using_volumes_with_vic.md#create-a-volume-in-a-volume-store

This is likely caused by the vmwarecna/nginx image specifying a volume to use when creating a container.

If you create a default volume store when deploying your VCH, vmwarecna/nginx should be able to find and use it without error.

Instructions are here: https://github.com/vmware/vic/blob/master/doc/user_doc/vic_installation/vch_installer_options.md#--volume-store

Tagging @matthewavery @fdawg4l in case I'm incorrect. :)

jzt commented 7 years ago

Following up: I checked the vmwarecna/nginx metadata and was able to verify that it does use a volume:

    "Volumes": {
      "/var/cache/nginx": {}
    }
vmwarelab commented 7 years ago

@jzt from the documentation you mentioned and thank you for that.

it mentioned that if you only require one volume store, you can set the volume store label to default. If you set the volume store label to default, container developers do not need to specify the --opt VolumeStore=volume_store_label option when they run docker volume create.

so i added the volume-store parameter and i set it to default and the running nginx container now works

i personally still don't get what is volume-store is for ? where its created ? and why would you need it or need multiple volumes

fdawg4l commented 7 years ago

i personally still don't get what is volume-store is for ? where its created ? and why would you need it or need multiple volumes

Generally, enterprises have storage devices that present as different datastores in vsphere. Imagine the following datastores: [ReallyFastButNotBackedUp], [MarginallyFastAndBackedUpNightly], [ReallySlowButBackedUpHourly]. One can create volumes depending on their storage policy / backup policy for a given set of a containers. It's to allow the storage options to container admins and users which may be available in vsphere.

mhagen-vmware commented 7 years ago

this appears to be generally resolved and has strayed quite far from the original issue. Please open a new issue if you are seeing any other problems.