vmware / vic

vSphere Integrated Containers Engine is a container runtime for vSphere.
http://vmware.github.io/vic
Other
640 stars 173 forks source link

Nightly-6.7-5-22-NFS-Volume-Kill NFS Server device in use #7621

Closed rajanashok closed 5 years ago

rajanashok commented 6 years ago
Documentation:  
Runs the given command in the system and returns the RC and output.

Start / End / Elapsed:  20180330 07:25:46.040 / 20180330 07:26:54.609 / 00:01:08.569
07:25:46.044    INFO    Running command 'docker -H 10.161.252.5:2376 --tls run -v nfs-volume_VCH-0-8152:/mydata busybox sh -c "tail -5 /mydata/test_nfs_kill.txt" 2>&1'.    
07:26:54.609    INFO    ${rc} = 125 
07:26:54.609    INFO    ${tailOutput} = docker: Error response from daemon: Conflict error from portlayer: device nfs-volume_VCH-0-8152 in use.

https://storage.cloud.google.com/vic-ci-logs/vic_nightly_logs_2018-03-30-05-07-16.zip?authuser=1

rajanashok commented 6 years ago

Seen again on 6.0 Nightly. https://storage.cloud.google.com/vic-ci-logs/vic_nightly_logs_2018-04-05-20-32-56.zip?authuser=1

hickeng commented 6 years ago

Copying this comment

VIC version:

v1.4.0-dev-17605-6144b31

Deployment details:

Nightly 6.7 2018-03-20 Kill NFS Server

Logs:

Running command 'docker -H 10.193.52.242:2376 --tls run -v nfs-volume_VCH-0-2514:/mydata busybox sh -c "tail -5 /mydata/test_nfs_kill.txt" 2>&1'.
${rc} = 125
${tailOutput} = docker: Error response from daemon: Conflict error from portlayer: device nfs-volume_VCH-0-2514 in use.

https://storage.cloud.google.com/vic-ci-logs/vic_nightly_logs_2018-03-20-05-11-24.zip?authuser=1

hickeng commented 6 years ago

The first item of curiosity that should be investigated for this error is that NFS volumes should never return a "device in use" error as they are explicitly present to allow concurrent access.

We see the following obvious error in the portlayer log:

types.GenericVmConfigFault{  
   VmConfigFault:types.VmConfigFault{  
      VimFault:types.VimFault{  
         MethodFault:types.MethodFault{  
            FaultCause:(*types.LocalizedMethodFault)(nil),
            FaultMessage:[]types.LocalizableMessage{  
               types.LocalizableMessage{  
                  DynamicData:types.DynamicData{ },
                  Key:"msg.fileio.lock",
                  Arg:[]types.KeyAnyValue(nil),
                  Message:"Failed to lock the file"
               },
               types.LocalizableMessage {  
                  DynamicData:types.DynamicData{ },
                  Key:"msg.disk.noBackEnd",
                  Arg:[]types.KeyAnyValue{  
                     types.KeyAnyValue{  
                        DynamicData:types.DynamicData{ },
                        Key:"1",
                        Value:"/vmfs/volumes/5ac6e437-8b95305d-113f-0200449c8b82/VCH-0-6243-VOL/volumes/nfs-volume_VCH-0-6243/nfs-volume_VCH-0-6243.vmdk"
                     }
                  },
                  Message:"Cannot open the disk '/vmfs/volumes/5ac6e437-8b95305d-113f-0200449c8b82/VCH-0-6243-VOL/volumes/nfs-volume_VCH-0-6243/nfs-volume_VCH-0-6243.vmdk' or one of the snapshot disks it depends on. "
               },
               types.LocalizableMessage {  
                  DynamicData:types.DynamicData { },
                  Key:"msg.moduletable.powerOnFailed",
                  Arg:[ ] types.KeyAnyValue {  
                     types.KeyAnyValue {  
                        DynamicData:types.DynamicData { },
                        Key:"1",
                        Value:"Disk"
                     }
                  },
                  Message:"Module Disk power on failed. "
               },
               types.LocalizableMessage {  
                  DynamicData:types.DynamicData { },
                  Key:"msg.vmx.poweron.failed",
                  Arg:[ ] types.KeyAnyValue(nil),
                  Message:"Failed to start the virtual machine."
               }
            }
         }
      }
   },
   Reason:"Failed to lock the file"
}

It's immediately obvious that the volume we're using it's not an NFS volume.

Tracking down why it's not an NFS volume. After the VCH is rebooted we get the following error:

Apr  6 2018 03:21:33.057Z INFO  op=290.58: Creating nfs volumestore nfsVolumeStore on nfs://10.161.148.36/store?uid=0&gid=0
Apr  6 2018 03:21:33.057Z DEBUG op=290.58: Mounting nfs://10.161.148.36/store?uid=0&gid=0
Apr  6 2018 03:23:40.329Z ERROR op=290.58: unable to mount volume: dial tcp :812->10.161.148.36:2049: getsockopt: connection timed out
Apr  6 2018 03:23:40.329Z ERROR op=290.58: error occurred while attempting to mount volumestore (nfsVolumeStore). err: (dial tcp :812->10.161.148.36:2049: getsockopt: connection timed out)
Apr  6 2018 03:23:40.329Z ERROR op=290.58: dial tcp :812->10.161.148.36:2049: getsockopt: connection timed out

So what is happening is that we're failing to reconnect to and index the NFS volume store meaning that when we attempt to use a volume we have previously created it's not found in the volume cache and is created in the default volume store as a vmdk based volume.

Short of refusing to start fully if a defined NFS volume store cannot be reached I'm not sure what else can be done here. The worst of it is that we'll end up with a name conflict if/when the NFS volume store becomes available again.

hickeng commented 6 years ago

In the NFS server logs we see the following from the time period in which we should be mounting and indexing the volume store (20:21:31):

[Thu Apr  5 20:20:47 2018] svc_recv: found XPT_CLOSE
[Thu Apr  5 20:20:47 2018] svc: svc_delete_xprt(ffff88003b9c1000)
[Thu Apr  5 20:20:47 2018] svc: svc_tcp_sock_detach(ffff88003b9c1000)
[Thu Apr  5 20:20:47 2018] svc: svc_sock_detach(ffff88003b9c1000)
[Thu Apr  5 20:20:47 2018] svc: svc_sock_free(ffff88003b9c1000)
[Thu Apr  5 20:20:47 2018] svc: server ffff88003b94c000 waiting for data (to = 900000)
[Thu Apr  5 20:21:13 2018] NFSD: laundromat service - starting
[Thu Apr  5 20:21:13 2018] NFSD: laundromat_main - sleeping for 90 seconds
[Thu Apr  5 20:21:31 2018] nfsd: exp_rootfh(/store [ffff88003d400240] *:dm-0/261634)
[Thu Apr  5 20:21:31 2018] nfsd: fh_compose(exp fc:00/261634 /store, ino=261634)
[Thu Apr  5 20:21:38 2018] svc_age_temp_xprts
[Thu Apr  5 20:22:44 2018] NFSD: laundromat service - starting
[Thu Apr  5 20:22:44 2018] NFSD: laundromat_main - sleeping for 90 seconds
[Thu Apr  5 20:24:14 2018] NFSD: laundromat service - starting
[Thu Apr  5 20:24:14 2018] NFSD: laundromat_main - sleeping for 90 seconds
[Thu Apr  5 20:25:44 2018] NFSD: laundromat service - starting
[Thu Apr  5 20:25:44 2018] NFSD: laundromat_main - sleeping for 90 seconds
[Thu Apr  5 20:27:14 2018] NFSD: laundromat service - starting
[Thu Apr  5 20:27:14 2018] NFSD: laundromat_main - sleeping for 90 seconds
[Thu Apr  5 20:27:38 2018] svc_age_temp_xprts
[Thu Apr  5 20:27:38 2018] queuing xprt ffff88003c76e000 for closing

vs this which I think is the initial mount/mkdir pair for a readonly share:

Thu Apr  5 20:10:59 2018] nfsd: exp_rootfh(/exports/storage1 [ffff88003d4a5180] *:sdb1/2)
[Thu Apr  5 20:10:59 2018] nfsd: fh_compose(exp 08:11/2 /, ino=2)
[Thu Apr  5 20:10:59 2018] svc: socket ffff88003c439500 TCP (listen) state change 10
[Thu Apr  5 20:10:59 2018] svc: transport ffff88003ca55000 served by daemon ffff88003c5c0000
[Thu Apr  5 20:10:59 2018] svc: tcp_accept ffff88003ca55000 sock ffff88003cf1c780
[Thu Apr  5 20:10:59 2018] nfsd: connect from 10.161.136.49, port=632
[Thu Apr  5 20:10:59 2018] svc: svc_setup_socket ffff88003cf48280
[Thu Apr  5 20:10:59 2018] setting up TCP socket for reading
[Thu Apr  5 20:10:59 2018] svc: svc_setup_socket created ffff88002cbe6000 (inet ffff88003c43c600)
[Thu Apr  5 20:10:59 2018] svc: transport ffff88002cbe6000 served by daemon ffff88003c49e000
[Thu Apr  5 20:10:59 2018] svc: transport ffff88003ca55000 served by daemon ffff88003c49c000
[Thu Apr  5 20:10:59 2018] svc: server ffff88003c5c0000 waiting for data (to = 900000)
[Thu Apr  5 20:10:59 2018] svc: server ffff88003c49e000, pool 0, transport ffff88002cbe6000, inuse=2
[Thu Apr  5 20:10:59 2018] svc: tcp_recv ffff88002cbe6000 data 1 conn 0 close 0
[Thu Apr  5 20:10:59 2018] svc: socket ffff88002cbe6000 recvfrom(ffff88002cbe62b8, 4) = -11
[Thu Apr  5 20:10:59 2018] RPC: TCP recv_record got -11
[Thu Apr  5 20:10:59 2018] RPC: TCP recvfrom got EAGAIN
[Thu Apr  5 20:10:59 2018] svc: got len=0
[Thu Apr  5 20:10:59 2018] svc: server ffff88003c49e000 waiting for data (to = 900000)
[Thu Apr  5 20:10:59 2018] svc: tcp_accept ffff88003ca55000 sock ffff88003cf1c780
[Thu Apr  5 20:10:59 2018] svc: server ffff88003c49c000 waiting for data (to = 900000)
[Thu Apr  5 20:10:59 2018] svc: socket ffff88003c43c600 TCP data ready (svsk ffff88002cbe6000)
[Thu Apr  5 20:10:59 2018] svc: transport ffff88002cbe6000 served by daemon ffff88003c49c000
[Thu Apr  5 20:10:59 2018] svc: server ffff88003c49c000, pool 0, transport ffff88002cbe6000, inuse=2
[Thu Apr  5 20:10:59 2018] svc: tcp_recv ffff88002cbe6000 data 1 conn 0 close 0
[Thu Apr  5 20:10:59 2018] svc: socket ffff88002cbe6000 recvfrom(ffff88002cbe62bc, 0) = 4
[Thu Apr  5 20:10:59 2018] svc: TCP record, 92 bytes
[Thu Apr  5 20:10:59 2018] svc: socket ffff88002cbe6000 recvfrom(ffff88003c56505c, 4004) = 92
[Thu Apr  5 20:10:59 2018] svc: TCP final record (92 bytes)
[Thu Apr  5 20:10:59 2018] svc: got len=92
[Thu Apr  5 20:10:59 2018] svc: svc_authenticate (1)
[Thu Apr  5 20:10:59 2018] RPC:       Want update, refage=120, age=0
[Thu Apr  5 20:10:59 2018] svc: calling dispatcher
[Thu Apr  5 20:10:59 2018] nfsd_dispatch: vers 3 proc 19
[Thu Apr  5 20:10:59 2018] nfsd: FSINFO(3)   20: 00060001 7c8e2bed 0c4f4a08 da7a02a6 8405219b 00000000
[Thu Apr  5 20:10:59 2018] nfsd: fh_verify(20: 00060001 7c8e2bed 0c4f4a08 da7a02a6 8405219b 00000000)
[Thu Apr  5 20:10:59 2018] RPC:       Want update, refage=120, age=0
[Thu Apr  5 20:10:59 2018] found domain *
[Thu Apr  5 20:10:59 2018] found fsidtype 6
[Thu Apr  5 20:10:59 2018] found fsid length 16
[Thu Apr  5 20:10:59 2018] Path seems to be </exports/storage1>
[Thu Apr  5 20:10:59 2018] Found the path /exports/storage1
[Thu Apr  5 20:10:59 2018] svc: socket ffff88002cbe6000 sendto([ffff88003c566000 84... ], 84) = 84 (addr 10.161.136.49, port=632)
[Thu Apr  5 20:10:59 2018] svc: server ffff88003c49c000 waiting for data (to = 900000)
[Thu Apr  5 20:10:59 2018] svc: socket ffff88003c43c600 TCP data ready (svsk ffff88002cbe6000)
[Thu Apr  5 20:10:59 2018] svc: transport ffff88002cbe6000 served by daemon ffff88003c49c000
[Thu Apr  5 20:10:59 2018] svc: server ffff88003c49c000, pool 0, transport ffff88002cbe6000, inuse=2
[Thu Apr  5 20:10:59 2018] svc: tcp_recv ffff88002cbe6000 data 1 conn 0 close 0
[Thu Apr  5 20:10:59 2018] svc: socket ffff88002cbe6000 recvfrom(ffff88002cbe62bc, 0) = 4
[Thu Apr  5 20:10:59 2018] svc: TCP record, 132 bytes
[Thu Apr  5 20:10:59 2018] svc: socket ffff88002cbe6000 recvfrom(ffff88003c565084, 3964) = 132
[Thu Apr  5 20:10:59 2018] svc: TCP final record (132 bytes)
[Thu Apr  5 20:10:59 2018] svc: got len=132
[Thu Apr  5 20:10:59 2018] svc: svc_authenticate (1)
[Thu Apr  5 20:10:59 2018] svc: calling dispatcher
[Thu Apr  5 20:10:59 2018] nfsd_dispatch: vers 3 proc 9
[Thu Apr  5 20:10:59 2018] nfsd: MKDIR(3)    20: 00060001 7c8e2bed 0c4f4a08 da7a02a6 8405219b 00000000 volumes
[Thu Apr  5 20:10:59 2018] nfsd: fh_verify(20: 00060001 7c8e2bed 0c4f4a08 da7a02a6 8405219b 00000000)
[Thu Apr  5 20:10:59 2018] fh_verify: / permission failure, acc=3, error=30
[Thu Apr  5 20:10:59 2018] svc: socket ffff88002cbe6000 sendto([ffff88003dae4000 124... ], 124) = 124 (addr 10.161.136.49, port=632)
[Thu Apr  5 20:10:59 2018] svc: server ffff88003c49c000 waiting for data (to = 900000)
[Thu Apr  5 20:11:00 2018] svc: socket ffff88003c43c600 TCP (connected) state change 8 (svsk ffff88002cbe6000)
[Thu Apr  5 20:11:00 2018] svc: transport ffff88002cbe6000 served by daemon ffff88003c49c000
[Thu Apr  5 20:11:00 2018] svc: socket ffff88003c43c600 TCP data ready (svsk ffff88002cbe6000)
[Thu Apr  5 20:11:00 2018] svc: transport ffff88002cbe6000 busy, not enqueued
[Thu Apr  5 20:11:00 2018] svc_recv: found XPT_CLOSE
[Thu Apr  5 20:11:00 2018] svc: svc_delete_xprt(ffff88002cbe6000)
[Thu Apr  5 20:11:00 2018] svc: svc_tcp_sock_detach(ffff88002cbe6000)
[Thu Apr  5 20:11:00 2018] svc: svc_sock_detach(ffff88002cbe6000)
[Thu Apr  5 20:11:00 2018] svc: svc_sock_free(ffff88002cbe6000)
[Thu Apr  5 20:11:00 2018] svc: server ffff88003c49c000 waiting for data (to = 900000)

When comparing them I'm not seeing any of the expected nfsd: connect from 10.161.136.49, port=632 or similar that relates to a new rpc connection being set up. I'm unsure whether this is significant or not.

mdubya66 commented 6 years ago

Making P1. Needs to be fixed, much work needed to get there.

renmaosheng commented 6 years ago

this still happens, in recent nightly run: https://vic-jenkins.eng.vmware.com/job/vic-master-scenario/21/robot/report/log.html#s1-s4-s14 Yanzhao, could you please take a look? thanks.

renmaosheng commented 6 years ago

https://vic-jenkins.eng.vmware.com/job/vic-master-scenario/40/robot/report/log.html latest run passed. Yanzhao, if this is not an issue, please close it.

zjs commented 5 years ago

We should change the Endpoint VM so that it does not start the API server successfully if does not successfully connect to the NFS volume it is configured with. This indicates a configuration issue and failing ensure that is detected.

YanzhaoLi commented 5 years ago

@hickeng, @zjs After discussion with @wjun , I don't think it works not to start API server successfully. What if the user fails to repair the nfs server, the API server will never start successfully. And we don't know how to delete an VolumeStore from a VCH.

My investigation of rebooting vch with regard to volume:

And When I try to stop portlayer server when nfs volume store fails, I noticed that the portlayer was set always restart. lib/install/management/appliance.go, line 692