rposudnevskiy / RBDSR

RBDSR - XenServer/XCP-ng Storage Manager plugin for CEPH
GNU Lesser General Public License v2.1
58 stars 23 forks source link

Existing symlinks causing map fail #47

Closed mdmeier closed 7 years ago

mdmeier commented 7 years ago

A weird network issue caused a disconnect between the Ceph SR and Xenserver pool. Cloudstack shut-down a number of VMs that wouldn't come back up once connectivity was re-established. It was a weird, difficult problem to debug, and may have exposed a bug:

When attempting to attach a VDI to a VM the following occured in /var/log/SMlog

Jun 1 22:53:27 cloud102-6 SM: [22135] ['rbd-nbd', '--nbds_max', '64', 'map', 'RBD_XenStorage-2dd455e9-0de4-4ed8-af62-64e1a4ace678/VHD-54c1f244-d62c-47ac-90d2-f40be0f6fe5a', '--name', 'client.admin'] Jun 1 22:53:27 cloud102-6 SM: [22135] pread SUCCESS Jun 1 22:53:27 cloud102-6 SM: [22135] ['ln', '-s', '/dev/nbd4', '/dev/nbd/RBD_XenStorage-2dd455e9-0de4-4ed8-af62-64e1a4ace678/VHD-54c1f244-d62c-47ac-90d2-f40be0f6fe5a'] Jun 1 22:53:28 cloud102-6 SM: [22135] pread SUCCESS Jun 1 22:53:28 cloud102-6 SM: [22135] ['ln', '-s', '/dev/nbd4', '/run/sr-mount/2dd455e9-0de4-4ed8-af62-64e1a4ace678/54c1f244-d62c-47ac-90d2-f40be0f6fe5a'] Jun 1 22:53:28 cloud102-6 SM: [22135] FAILED in util.pread: (rc 1) stdout: '', stderr: 'ln: failed to create symbolic link '/run/sr-mount/2dd455e9-0de4-4ed8-af62-64e1a4ace678/54c1f244-d62c-47ac-90d2-f40be0f6fe5a': File exists Jun 1 22:53:28 cloud102-6 SM: [22135] ' Jun 1 22:53:28 cloud102-6 SM: [21718] Exception in activate/attach Jun 1 22:53:28 cloud102-6 SM: [21718] Removed host key host_OpaqueRef:24b3b5c5-996a-6b74-90c8-0c930c1ba751 for 54c1f244-d62c-47ac-90d2-f40be0f6fe5a Jun 1 22:53:28 cloud102-6 SM: [21718] BLKTAP2:<function _activate_locked at 0x13fa6e0>: EXCEPTION <class 'XenAPI.Failure'>, ['XENAPI_PLUGIN_FAILURE', 'map', 'CommandException', 'Operation not permitted'] Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/blktap2.py", line 87, in wrapper Jun 1 22:53:28 cloud102-6 SM: [21718] ret = op(self, args) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/blktap2.py", line 1574, in _activate_locked Jun 1 22:53:28 cloud102-6 SM: [21718] self._attach(sr_uuid, vdi_uuid) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/blktap2.py", line 1639, in _attach Jun 1 22:53:28 cloud102-6 SM: [21718] attach_info = xmlrpclib.loads(self.target.attach(sr_uuid, vdi_uuid))[0][0] Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/blktap2.py", line 1112, in attach Jun 1 22:53:28 cloud102-6 SM: [21718] return self.vdi.attach(sr_uuid, vdi_uuid) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/RBDSR", line 418, in attach Jun 1 22:53:28 cloud102-6 SM: [21718] self._map_VHD(vdi_uuid, self.size, "none") Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/cephutils.py", line 530, in _map_VHD Jun 1 22:53:28 cloud102-6 SM: [21718] self._call_plugin('map',args) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/cephutils.py", line 435, in _call_plugin Jun 1 22:53:28 cloud102-6 SM: [21718] if not self.session.xenapi.host.call_plugin(host_ref, "ceph_plugin", op, args): Jun 1 22:53:28 cloud102-6 SM: [21718] File "/usr/lib/python2.7/site-packages/XenAPI.py", line 254, in call Jun 1 22:53:28 cloud102-6 SM: [21718] return self.send(self.name, args) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/usr/lib/python2.7/site-packages/XenAPI.py", line 150, in xenapi_request Jun 1 22:53:28 cloud102-6 SM: [21718] result = _parse_result(getattr(self, methodname)(full_params)) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/usr/lib/python2.7/site-packages/XenAPI.py", line 228, in _parse_result Jun 1 22:53:28 cloud102-6 SM: [21718] raise Failure(result['ErrorDescription']) Jun 1 22:53:28 cloud102-6 SM: [21718] Jun 1 22:53:28 cloud102-6 SM: [21718] Raising exception [46, The VDI is not available [opterr=['XENAPI_PLUGIN_FAILURE', 'map', 'CommandException', 'Operation not permitted']]] Jun 1 22:53:28 cloud102-6 SM: [21718] lock: released /var/lock/sm/54c1f244-d62c-47ac-90d2-f40be0f6fe5a/vdi Jun 1 22:53:28 cloud102-6 SM: [21718] generic exception: vdi_activate: EXCEPTION <class 'SR.SROSError'>, The VDI is not available [opterr=['XENAPI_PLUGIN_FAILURE', 'map', 'CommandException', 'Operation not permitted']] Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/SRCommand.py", line 110, in run Jun 1 22:53:28 cloud102-6 SM: [21718] return self._run_locked(sr) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/SRCommand.py", line 159, in _run_locked Jun 1 22:53:28 cloud102-6 SM: [21718] rv = self._run(sr, target) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/SRCommand.py", line 264, in _run Jun 1 22:53:28 cloud102-6 SM: [21718] writable, caching_params) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/blktap2.py", line 1541, in activate Jun 1 22:53:28 cloud102-6 SM: [21718] if self._activate_locked(sr_uuid, vdi_uuid, options): Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/blktap2.py", line 95, in wrapper Jun 1 22:53:28 cloud102-6 SM: [21718] raise xs_errors.XenError(excType, opterr=msg) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/xs_errors.py", line 52, in init Jun 1 22:53:28 cloud102-6 SM: [21718] raise SR.SROSError(errorcode, errormessage) Jun 1 22:53:28 cloud102-6 SM: [21718] Jun 1 22:53:28 cloud102-6 SM: [21718] ***** RBD: EXCEPTION <class 'SR.SROSError'>, The VDI is not available [opterr=['XENAPI_PLUGIN_FAILURE', 'map', 'CommandException', 'Operation not permitted']] Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/SRCommand.py", line 353, in run Jun 1 22:53:28 cloud102-6 SM: [21718] ret = cmd.run(sr) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/SRCommand.py", line 110, in run Jun 1 22:53:28 cloud102-6 SM: [21718] return self._run_locked(sr) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/SRCommand.py", line 159, in _run_locked Jun 1 22:53:28 cloud102-6 SM: [21718] rv = self._run(sr, target) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/SRCommand.py", line 264, in _run Jun 1 22:53:28 cloud102-6 SM: [21718] writable, caching_params) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/blktap2.py", line 1541, in activate Jun 1 22:53:28 cloud102-6 SM: [21718] if self._activate_locked(sr_uuid, vdi_uuid, options): Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/blktap2.py", line 95, in wrapper Jun 1 22:53:28 cloud102-6 SM: [21718] raise xs_errors.XenError(excType, opterr=msg) Jun 1 22:53:28 cloud102-6 SM: [21718] File "/opt/xensource/sm/xs_errors.py", line 52, in init Jun 1 22:53:28 cloud102-6 SM: [21718] raise SR.SROSError(errorcode, errormessage) Jun 1 22:53:28 cloud102-6 SM: [21718] Jun 1 22:53:28 cloud102-6 SM: [21718] lock: closed /var/lock/sm/54c1f244-d62c-47ac-90d2-f40be0f6fe5a/vdi

Seems existing symlinks were present, causing a crash.

This was ultimately resolved by removing ALL symlinks on every pool server:

cd /run/sr-mount/ rm -f cd /dev/nbd/RBD_XenStorage- rm -f

I also did a mass vdi-forget and sr-scan, but I don't think those actually helped with the fix at all.

Could a future version remove existing symlinks before trying to create them? Also removing any existing locks in ceph may help too (rbd lock remove ...)

rposudnevskiy commented 7 years ago

Hi, Please try this update 6967923

mdmeier commented 7 years ago

Thanks, Roman. Been a busy couple days. Will implement tonight.