rposudnevskiy / RBDSR

RBDSR - XenServer/XCP-ng Storage Manager plugin for CEPH
GNU Lesser General Public License v2.1
58 stars 23 forks source link

One VDI becomes unbootable #51

Open maxcuttins opened 7 years ago

maxcuttins commented 7 years ago

I have an issue with one VDI. Suddenly stop and show:

"Failed","Starting VM '
Internal error: xenopsd internal error: Memory_interface.Internal_error("VM = 980788af-7864-4a96-b5c3-8fbde2961fa9; domid = 42; Bootloader.Bad_error Traceback (most recent call last):\n  File \"/usr/bin/pygrub\", line 984, in <module>\n    part_offs = get_partition_offsets(file)\n  File \"/usr/bin/pygrub\", line 116, in get_partition_offsets\n    image_type = identify_disk_image(file)\n  File \"/usr/bin/pygrub\", line 60, in identify_disk_image\n    buf = os.read(fd, read_size_roundup(fd, 0x8006))\nOSError: [Errno 5] Input/output error\n")
maxcuttins commented 7 years ago

3 VDI

maxcuttins commented 7 years ago

I got this after a xe sr-scan on the Ceph Storage: `There was an SR backend failure. status: non-zero exit stdout: stderr: Traceback (most recent call last): File "/opt/xensource/sm/RBDSR", line 774, in SRCommand.run(RBDSR, DRIVER_INFO) File "/opt/xensource/sm/SRCommand.py", line 352, in run ret = cmd.run(sr) File "/opt/xensource/sm/SRCommand.py", line 110, in run return self._run_locked(sr) File "/opt/xensource/sm/SRCommand.py", line 159, in _run_locked rv = self._run(sr, target) File "/opt/xensource/sm/SRCommand.py", line 338, in _run return sr.scan(self.params['sr_uuid']) File "/opt/xensource/sm/RBDSR", line 244, in scan scanrecord.synchronise_new() File "/opt/xensource/sm/SR.py", line 581, in synchronise_new vdi._db_introduce() File "/opt/xensource/sm/VDI.py", line 312, in _db_introduce vdi = self.sr.session.xenapi.VDI.db_introduce(uuid, self.label, self.description, self.sr.sr_ref, ty, self.shareable, self.read_only, {}, self.location, {}, sm_config, self.managed, str(self.size), str(self.utilisation), metadata_of_pool, is_a_snapshot, xmlrpclib.DateTime(snapshot_time), snapshot_of) File "/usr/lib/python2.7/site-packages/XenAPI.py", line 248, in call return self.send(self.name, args) File "/usr/lib/python2.7/site-packages/XenAPI.py", line 150, in xenapi_request result = _parse_result(getattr(self, methodname)(*full_params)) File "/usr/lib64/python2.7/xmlrpclib.py", line 1233, in call return self.send(self.name, args) File "/usr/lib64/python2.7/xmlrpclib.py", line 1581, in request allow_none=self.__allow_none) File "/usr/lib64/python2.7/xmlrpclib.py", line 1086, in dumps data = m.dumps(params) File "/usr/lib64/python2.7/xmlrpclib.py", line 633, in dumps dump(v, write) File "/usr/lib64/python2.7/xmlrpclib.py", line 655, in dump f(self, value, write) File "/usr/lib64/python2.7/xmlrpclib.py", line 757, in dump_instance self.dump_struct(value.dict, write) File "/usr/lib64/python2.7/xmlrpclib.py", line 736, in dump_struct dump(v, write) File "/usr/lib64/python2.7/xmlrpclib.py", line 655, in dump f(self, value, write) File "/usr/lib64/python2.7/xmlrpclib.py", line 757, in dump_instance self.dump_struct(value.dict__, write) File "/usr/lib64/python2.7/xmlrpclib.py", line 736, in dump_struct dump(v, write) File "/usr/lib64/python2.7/xmlrpclib.py", line 655, in __dump f(self, value, write) File "/usr/lib64/python2.7/xmlrpclib.py", line 666, in dump_int raise OverflowError, "int exceeds XML-RPC limits" OverflowError: int exceeds XML-RPC limits

[root@xenserver-11 archive]# xe sr-scan uuid=51a45fd8-a4d1-4202-899c-00a0f81054cc

Broadcast message from systemd-journald@xenserver-11 (Sun 2017-06-25 05:34:16 CEST):

tapdisk[4632]: tapdisk-syslog: 1 messages dropped

Broadcast message from systemd-journald@xenserver-11 (Sun 2017-06-25 05:34:16 CEST):

tapdisk[4632]: tapdisk-syslog: 3 messages dropped

Broadcast message from systemd-journald@xenserver-11 (Sun 2017-06-25 05:34:16 CEST):

tapdisk[4632]: tapdisk-syslog: 1 messages dropped `

rposudnevskiy commented 7 years ago

Hi, Could you please run xe sr-scan again and send me the files /var/log/SMlog and /var/log/xensource.log Thanks

rposudnevskiy commented 7 years ago

The reason of this error is that for some operations (resize, update, snapshot, clone etc.) it is required to unmap rbd-nbd device, execute operation, and mount again. If at the certain moment only one unmap/map operation is executed you don't have problem. But if several Vdi is mapped simultaneously, then the rbd-nbd device that was unmapped previously (for resize, update operation etc.) may be mapped again with other device instance number that differs the number that it had before unmap. As result the Vdi can't be unpaused after operation and we have error mentioned above.

Last update should fix this problem.

maxcuttins commented 7 years ago

I don't know if it's a good way to fix the issue. Thinking about it probably the best way is to stop using cache in order to expect some results that are unattended by system. Probably it's better to create a function that runned retrieve automatically the nbd-device everytime we need to issue a new command. Having no-cache probably means less speed but up-to-date data and references. Probably a simple function that parse the UUID of the VDI on the fly and retrieve the right NBD-device should be a really thin and light piece of function.

rposudnevskiy commented 7 years ago

Hi, It's not a cache. The registry just track the attached vdi and store the reference to nbd device corresponding to the attached vdi. On vdi attach the reference is created and on vdi detach the reference is deleted. Also the registry is cleared if SR is detached. So the registry should always have an actual info about attached vdis, i hope.

blodone commented 6 years ago

a workaround is to detach, rename the UUID of the rbd image, then set :uuid with image-meta set ... rescan and attach. Boot then works... it seems if the unmap is interrupted or anything other does not unlock -> with renaming its fixed without rebooting / re attaching the whole SR

blodone commented 6 years ago

for v2.0 rbd nbd i created a patch to use the names instead of /dev/nbdXX numbers: https://github.com/rposudnevskiy/RBDSR/pull/79