vmware / vic

vSphere Integrated Containers Engine is a container runtime for vSphere.
http://vmware.github.io/vic
Other
640 stars 173 forks source link

vic-machine: delete times out, then fails repeatedly on remaining operations #3307

Closed hickeng closed 7 years ago

hickeng commented 7 years ago

If vic-machine delete times out we get the following:

INFO[2016-11-23T09:17:38Z] ### Removing VCH ####
INFO[2016-11-23T09:17:38Z]
INFO[2016-11-23T09:17:38Z] VCH ID: VirtualMachine:vm-1890
INFO[2016-11-23T09:17:43Z] Removing VMs
INFO[2016-11-23T09:17:43Z] Removing image stores
INFO[2016-11-23T09:20:38Z] Removing volume stores
INFO[2016-11-23T09:20:38Z] Deleting volume store "XXX" on Datastore "YYY" at path "ZZZ"
ERRO[2016-11-23T09:20:38Z] Error finding datastore "YYY": context deadline exceeded
INFO[2016-11-23T09:20:38Z] Removing appliance VM network devices
ERRO[2016-11-23T09:20:38Z] Failed to get vm power status "VirtualMachine:vm-1890": context deadline exceeded
INFO[2016-11-23T09:20:38Z] Collecting xxx vpxd.log
ERRO[2016-11-23T09:20:38Z] Failed to collect xxx vpxd.log: context deadline exceeded
WARN[2016-11-23T09:20:38Z] No log data for xxx vpxd.log
ERRO[2016-11-23T09:20:38Z] Failed to browse folder "[YYY] vSphere Integrated Containers - ImageStore/vchname/kvStores": context deadline exceeded
context deadline exceeded
context deadline exceeded
ERRO[2016-11-23T09:20:38Z] --------------------
ERRO[2016-11-23T09:20:38Z] vic-machine-linux failed: Delete timed out: use --timeout to add more time

There are several things wrong with this:

  1. the same timeout issues as #2163
  2. the attempts at subsequent operations using the same context that has already expired - this makes the failure output much more ugly than needed
  3. why is deleting the images taking this long
fdawg4l commented 7 years ago

why is deleting the images taking this long

What kind of setup is this? VC + NFS?

Chopper3 commented 7 years ago

Yes, 20 x dual E5-v4/512GB 6.0u2 hosts, 10Gbps NICs all-round, normally we store on FC but as this is a test we're storing on a Windows 2012R2-based NFS mount but the NFS server has 8TB of SSD cache against 400TB of 7.2krpm - so it shouldn't be the slowest thing in the world. If I run the delete again 10-15 minutes later it works right away, fast too.

Chopper3 commented 7 years ago

As requested

vch deletion issue inc debug.txt

emlin commented 7 years ago

@Chopper3 Checked the log files. It looks pretty good to install one vch, which takes just about 30s. Uploading images files usually takes sometime if the network is slow.

But the problem is that, delete vmdk files sometimes takes too long time.

time=2016-11-24T11:53:42.178114286Z level=debug msg=[BEGIN] [github.com/vmware/vic/lib/install/management.(*Dispatcher).deleteVMFSFiles:178] [NotMe-rm101nas01-vms] vSphere Integrated Containers - ImageStore/rm101vch4050/VIC/4203de8c-0a96-94a9-b872-daba794bd918/images/scratch/scratch.vmdk
time=2016-11-24T11:53:42.268294774Z level=debug msg=[ END ] [github.com/vmware/vic/lib/install/management.(*Dispatcher).deleteVMFSFiles:178] [90.182258ms] [NotMe-rm101nas01-vms] vSphere Integrated Containers - ImageStore/rm101vch4050/VIC/4203de8c-0a96-94a9-b872-daba794bd918/images/scratch/scratch.vmdk
time=2016-11-24T11:53:42.268423515Z level=debug msg=[BEGIN] [github.com/vmware/vic/lib/install/management.(*Dispatcher).deleteVMFSFiles:178] [NotMe-rm101nas01-vms] vSphere Integrated Containers - ImageStore/rm101vch4050/VIC/4203de8c-0a96-94a9-b872-daba794bd918/images/scratch/scratch-flat.vmdk
time=2016-11-24T11:54:22.358272504Z level=debug msg=[ END ] [github.com/vmware/vic/lib/install/management.(*Dispatcher).deleteVMFSFiles:178] [40.089844521s] [NotMe-rm101nas01-vms] vSphere Integrated Containers - ImageStore/rm101vch4050/VIC/4203de8c-0a96-94a9-b872-daba794bd918/images/scratch/scratch-flat.vmdk

The first vmdk file is deleted in 90ms, but the scratch-flat.vmdk file takes 40s, so after several vmdk files deletion, the vic-machine delete easily exceeds 3m deadline.

I know that you already ran docker rmi to delete images, but not sure which version's vic you used. There will have image left over in the older version's vic. That's why vic-machine still find image files during delete. And since you've deleted most of, or even all of images files in the first vic-machine delete, it will be fast to run another round of delete. If I'm not wrong, you don't have to wait to run another round of vic-machine delete, that should succeed instantly.

This problem should be related to the NFS datastore, but at this time, I still cannot say too much about that. If there is vsphere log, it can help us to do more triage. But vic-machine delete has wrong assumption to download vc logs now after timeout issue happens.

So if possible, could you please help to collect vpxd and hostd logs? If you don't know how to do, and in case you have govc or could have govc installed. Here is the simple command to do that.

Here is the govc code and document: https://github.com/vmware/govmomi/tree/master/govc If you can install govc, after vic-machine delete failed with timeout failure, you can run following command to collect vc and host logs:

govc logs -n=1000 > vpxd.log
govc ls /*/host/* | grep -v "Resources" | xargs -If sh -c "echo f;govc logs -n=1000 -host=f" > hostd.log
Chopper3 commented 7 years ago

Thanks for coming back to me Emma, this is with 0.8.0 but I saw the same behaviour in 0.7.0 - I have no idea why the one file takes so long to delete - the NAS this is writing to is pretty quick (200x900GB 10krpm with 8 x 400GB SSD in front on W2012R2 using 10Gbps NICs) - and you're right in that if I try to delete the VCH again it's very quick, maybe 20-25 seconds all in. Oh and I've been clearing down the left-over directories with every creation/deletion. I can try this on our FC storage and see if the problem remains - this is only one of our lab environments and the NAS is used for very early testing before moving onto the much faster FC SAN.

I'll try to get you the logs, won't be until Monday at the earliest though sorry - I need to do a small amount of travelling for work next week sorry.

Thanks again.

emlin commented 7 years ago

Thank you @Chopper3 for try to run vic in different envs, and help to collect more information. I think your NAS setup looks good, and vic didn't meant to restrict customer usage for NAS, but unfortunately, we still didn't have test infrastructure cover this scenario. @faijaz Did you ever try NFS?

fdawg4l commented 7 years ago

@emlin @Chopper3 , it's likely not the protocol (NFS) but the filesystem on the NAS. What filesystem is your NAS running and is it on the supported hardware list? Our files are sparse and traditional filesystems don't like sparse files.

Chopper3 commented 7 years ago

@fdawg4l This is in our lab so isn't an especially powerful NAS - I gave the specs above but it's a HPE DL380 Gen9 with a pair of 10-core Xeons, 256GB memory, 200 x 900GB 10krpm SAS disks in RAID 60 with 8 x 400GB SAS SSDs in R10 in front as a tiered-cache using Storage Spaces, it's formatted as NTFS, the NICs are Intel 560 10Gbps and the server's doing nothing else and is less than 3% full. When we benchmarked this using 'fio' on a physical Linux client it flooded the 10Gbps link on both sequential read and write (not on random though). I hope this helps. We have a number of other options in our lab (NetApp, various HPE SAN models including 3Par) - we would never consider implementing VIC on this NAS in production or reference, it'd be much higher performance there.

Chopper3 commented 7 years ago

@emlin Hi, finally got around to running the govc commands, though the longer one wouldn't run but I assumed you just wanted the host logs off the host running the VCH, so I did that. I've noticed that if I create a VCH and then immediately delete it that works every time, whereas if I pull/run something in a VCH then try deleting I get that error every time - so I've added two bunches of logs - both working and failing in case that helps. deletes-worked.zip deletes-failed.zip

fdawg4l commented 7 years ago

it's formatted as NTFS,

We're researching this configuration. We're wondering if an NTFS backed, NFS mounted, datastore is supported by vsphere AND supported by vsphere with this use case. We use sparse files for our vmdks (e.g. thin provisioned) and if the underlying filesystem (NTFS) is flattening them, then this could explain why deletion is taking a "long" time. Will get back to you when we have more information.

emlin commented 7 years ago

@Chopper3 many thanks for collecting these logs. The delete problem happens after you did something there, because after you pull image, we gonna write vmdk files to persist images, and then you can use that images to create container. And now that's the vmdk files deletion takes too much time, and we're afraid that's related to file system you're using, NTFS, but we're doing research now, so still not sure about the root cause.

btw, I'm looking at the hostd logs, and found some interesting information like:

2016-11-29T12:45:20.102Z info hostd[822C2B70] [Originator@6876 sub=Vimsvc.ha-eventmgr] Event 9284 : Device naa.6001438009b0549f00008000061f0000 performance has deteriorated. I/O latency increased from average value of 2810 microseconds to 447230 microseconds.

Just as no vic-machine log, I can only guess that's the event happens during the vic-machine images deletion.

fdawg4l commented 7 years ago

time=2016-11-24T11:54:22.358272504Z level=debug msg=[ END ] [github.com/vmware/vic/lib/install/management.(*Dispatcher).deleteVMFSFiles:178] [40.089844521s] [NotMe-rm101nas01-vms] vSphere Integrated Containers - ImageStore/rm101vch4050/VIC/4203de8c-0a96-94a9-b872-daba794bd918/images/scratch/scratch-flat.vmdk

40s to delete this file suggests its thick and not sparse (or thin) provisioned. Still looking for some details as to this being a supported configuration.

Chopper3 commented 7 years ago

Hi, I tried this whole create/run/delete cycle on an FC SAN and while the create phase took longer as there's real activity on the SAN as opposed to the NAS the deletion ran perfectly - so I guess this could easily be a NTFS-based-NAS/sparse-file issue after all. One thing I have noticed with a VCH deletion is that while the image-store directory gets deleted the volume-store doesn't - is that expected behaviour?

emlin commented 7 years ago

Yes, it is. If you didn't specify --force, we'll leave volume store there, cause user could write data into volume, and after new vch is created, and use same volume store configuration, they could mount existing data volume to new containers. That's why we intentionally leave volume store there.

emlin commented 7 years ago

Hi, I tried this whole create/run/delete cycle on an FC SAN and while the create phase took longer as there's real activity on the SAN as opposed to the NAS the deletion ran perfectly - so I guess this could easily be a NTFS-based-NAS/sparse-file issue after all.

good to know. 😄

Chopper3 commented 7 years ago

Thanks for the reasoning behind the --force option, that's very useful

hickeng commented 7 years ago

The timeout portion of this issue should be addressed in #2163

Chopper3 commented 7 years ago

Brilliant, thank you George, any news on the replacement for manually editing the host firewall configs? You mentioned there were a few options you were looking at (VIB and VAMI maybe??)

Regards,

Phil.

On 13 Apr 2017, at 14:25, George Hicken notifications@github.com wrote:

The timeout portion of this issue should be addressed in #2163 https://github.com/vmware/vic/issues/2163 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vmware/vic/issues/3307#issuecomment-293894478, or mute the thread https://github.com/notifications/unsubscribe-auth/AWb-vOZfcH-gJpeRLUtZgedVZ6hvlF-Nks5rviJUgaJpZM4K7GZ9.

hickeng commented 7 years ago

@Chopper3 At the moment we've got vic-machine update firewall --allow which will persistently enable the network serial port firewall rule - this is functional but broader than ideal. We keep discussing the use of a VIB, but the added management burden a VIB imposes isn't desirable. Obviously not such an issue for customers who already have VIBs deployed, but not a requirement we want to introduce just for VIC.

@karthik-narayan is the guy to chat with if there's a hard requirement for very strict persistent firewall config - which would mean optional VIB at this point.

corrieb commented 7 years ago

@Chopper3 Is the original issue still outstanding? It can take a while to delete images on certain times of vSphere datastore, but even if vic-machine times out, it should then subsequently clean up correctly

mdubya66 commented 7 years ago

Closing this issue as CNR. Not seen in newer versions. Please re-open.