Scrubbing failed - Githubissues

yongshengma commented 5 years ago

Hi Buddies,

I do have an emergency for help. The space is being used up again by huge snapshot.

I've ever got issue on deleting snapshots before and the reason was the offending key's fingerprint in known-hosts. But this time it looks very different and I can't figure out.

Oct 10 19:52:09 N03 celery: 2019-10-10 19:52:09 08800 +0800 - N03 - 6680/140511717349120 - lib/generic.py - _execute_scrub - 5510 - ERROR - Scrubber - vPool pool - StorageRouter N06 - vDisk WEB03_species-data1T - Scrubbing failed
Oct 10 19:52:09 N03 celery: Traceback (most recent call last):
Oct 10 19:52:09 N03 celery: File "/opt/OpenvStorage/ovs/lib/generic.py", line 359, in _execute_scrub
Oct 10 19:52:09 N03 celery: backend_config=Configuration.get_configuration_path(backend_config_key))
Oct 10 19:52:09 N03 celery: File "/usr/lib/python2.7/site-packages/rpyc/core/netref.py", line 199, in __call__
Oct 10 19:52:09 N03 celery: return syncreq(_self, consts.HANDLE_CALL, args, kwargs)
Oct 10 19:52:09 N03 celery: File "/usr/lib/python2.7/site-packages/rpyc/core/netref.py", line 72, in syncreq
Oct 10 19:52:09 N03 celery: return conn.sync_request(handler, oid, *args)
Oct 10 19:52:09 N03 celery: File "/usr/lib/python2.7/site-packages/rpyc/core/protocol.py", line 523, in sync_request
Oct 10 19:52:09 N03 celery: raise obj
Oct 10 19:52:09 N03 celery: RuntimeError: scrubber returned with nonzero exit status 256
Oct 10 19:52:09 N03 celery: ========= Remote Traceback (1) =========
Oct 10 19:52:09 N03 celery: Traceback (most recent call last):
Oct 10 19:52:09 N03 celery: File "/opt/OpenvStorage/tmp.jsi987o0xx/rpyc/core/protocol.py", line 347, in _dispatch_request
Oct 10 19:52:09 N03 celery: res = self._HANDLERS[handler](self, *args)
Oct 10 19:52:09 N03 celery: File "/opt/OpenvStorage/tmp.jsi987o0xx/rpyc/core/protocol.py", line 624, in _handle_call
Oct 10 19:52:09 N03 celery: return self._local_objects[oid](*args, **dict(kwargs))
Oct 10 19:52:09 N03 celery: RuntimeError: scrubber returned with nonzero exit status 256
Oct 10 19:52:09 N03 celery: 2019-10-10 19:52:09 14400 +0800 - N03 - 6680/140511717349120 - lib/generic.py - _execute_scrub - 5511 - INFO - Scrubber - vPool pool - StorageRouter N06 - Queue completely processed
Oct 10 19:52:09 N03 celery: 2019-10-10 19:52:09 14400 +0800 - N03 - 6680/140511846668032 - lib/generic.py - _deploy_stack_and_scrub - 5512 - INFO - Scrubber - vPool pool - StorageRouter N06 - Removing service ovs-albaproxy_cafb8b99-d701-4c56-83c4-03bbe43f32e1_scrub
Oct 10 19:52:11 N03 celery: 2019-10-10 19:52:11 90400 +0800 - N03 - 6680/140511846668032 - lib/generic.py - _deploy_stack_and_scrub - 5513 - INFO - Scrubber - vPool pool - StorageRouter N06 - Removed service ovs-albaproxy_cafb8b99-d701-4c56-83c4-03bbe43f32e1_scrub
Oct 10 19:52:11 N03 celery: 2019-10-10 19:52:11 90400 +0800 - N03 - 6680/140511846668032 - extensions/filemutex.py - release - 5514 - WARNING - A lock on /var/lock/ovs_flock_ovs_albaproxy_scrub was kept for 2.75983810425 sec
Oct 10 19:52:12 N03 celery: 2019-10-10 19:52:12 03200 +0800 - N03 - 47747/140512370628416 - celery/log.py - log - 6263 - ERROR - Task ovs.generic.execute_scrub[673ddbbf-138f-4038-91e1-23de7a0bebec] raised unexpected: Exception('Errors occurred while scrubbing:\n  - Scrubber - vPool pool - StorageRouter N06 - vDisk WEB03_species-data1T - Scrubbing failed',)
Oct 10 19:52:12 N03 celery: Traceback (most recent call last):
Oct 10 19:52:12 N03 celery: File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
Oct 10 19:52:12 N03 celery: R = retval = fun(*args, **kwargs)
Oct 10 19:52:12 N03 celery: File "/usr/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
Oct 10 19:52:12 N03 celery: return self.run(*args, **kwargs)
Oct 10 19:52:12 N03 celery: File "/opt/OpenvStorage/ovs/lib/helpers/decorators.py", line 346, in new_function
Oct 10 19:52:12 N03 celery: output = f(*args, **kwargs)
Oct 10 19:52:12 N03 celery: File "/opt/OpenvStorage/ovs/lib/generic.py", line 309, in execute_scrub
Oct 10 19:52:12 N03 celery: raise Exception('Errors occurred while scrubbing:\n  - {0}'.format('\n  - '.join(error_messages)))
Oct 10 19:52:12 N03 celery: Exception: Errors occurred while scrubbing:
Oct 10 19:52:12 N03 celery: - Scrubber - vPool pool - StorageRouter N06 - vDisk WEB03_species-data1T - Scrubbing failed

I also traced line by line in execute_scrub() and got RuntimeError: Connection reset by peer:

In [45]: 2019-10-09 15:09:43 100000 +0800 - N04 - 6722/140011496269568 - lib/generic.py - _deploy_stack_and_scrub - 0 - INFO - Scrubber - vPool pool - StorageRouter N05 - Deploying ALBA proxy ovs-albaproxy_e4d3b5d8-b8c2-4806-9d5b-41a9598a0c53_scrub
2019-10-09 15:09:48 69900 +0800 - N04 - 6722/140011496269568 - lib/generic.py - _deploy_stack_and_scrub - 1 - INFO - Scrubber - vPool pool - StorageRouter N05 - Deployed ALBA proxy ovs-albaproxy_e4d3b5d8-b8c2-4806-9d5b-41a9598a0c53_scrub
2019-10-09 15:09:48 71000 +0800 - N04 - 6722/140011496269568 - extensions/filemutex.py - release - 2 - WARNING - A lock on /var/lock/ovs_flock_ovs_albaproxy_scrub was kept for 4.70990991592 sec
2019-10-09 15:09:48 71300 +0800 - N04 - 6722/140011496269568 - lib/generic.py - _deploy_stack_and_scrub - 3 - INFO - Scrubber - vPool pool - StorageRouter N05 - Spawning 1 threads for proxy service ovs-albaproxy_e4d3b5d8-b8c2-4806-9d5b-41a9598a0c53_scrub
2019-10-09 15:09:51 10300 +0800 - N04 - 6722/140011463522048 - lib/generic.py - _execute_scrub - 4 - INFO - Scrubber - vPool pool - StorageRouter N05 - vDisk WEB03_species-data1T - Started scrubbing at location /mnt/hdd1/scrub_work_e4d3b5d8-b8c2-4806-9d5b-41a9598a0c53
2019-10-09 15:09:51 12800 +0800 - N04 - 6722/140011463522048 - lib/generic.py - _execute_scrub - 5 - INFO - Scrubber - vPool pool - StorageRouter N05 - vDisk WEB03_species-data1T - MDS master is not local, trigger handover
2019-10-09 15:09:51 14200 +0800 - N04 - 6722/140011463522048 - lib/decorators.py - log_message - 6 - INFO - Ensure single CHAINED mode - ID 1570604991_ohxcP05L07 - New task ovs.mds.ensure_safety with params {'vdisk_guid': '36d7e342-7a13-4481-bd22-4e5eb69e0304'} scheduled for execution
2019-10-09 15:09:51 14600 +0800 - N04 - 6722/140011463522048 - lib/mdsservice.py - ensure_safety - 7 - INFO - vDisk 36d7e342-7a13-4481-bd22-4e5eb69e0304 - Start checkup for vDisk WEB03_species-data1T
2019-10-09 15:09:51 30400 +0800 - N04 - 6722/140011463522048 - lib/mdsservice.py - ensure_safety - 8 - INFO - vDisk 36d7e342-7a13-4481-bd22-4e5eb69e0304 - Current configuration: [{'ip': '159.226.241.68', 'port': 26301}, {'ip': '159.226.241.80', 'port': 26301}, {'ip': '159.226.241.49', 'port': 26301}]
2019-10-09 15:09:51 30400 +0800 - N04 - 6722/140011463522048 - lib/mdsservice.py - ensure_safety - 9 - INFO - vDisk 36d7e342-7a13-4481-bd22-4e5eb69e0304 - Reconfiguration required. Reasons:
2019-10-09 15:09:51 30400 +0800 - N04 - 6722/140011463522048 - lib/mdsservice.py - ensure_safety - 10 - INFO - vDisk 36d7e342-7a13-4481-bd22-4e5eb69e0304 -    * Master 159.226.241.68:26301 is not local - Current location: 159.226.241.68 - Expected location: 159.226.241.47
2019-10-09 15:11:35 43100 +0800 - N04 - 6722/140011463522048 - lib/mdsservice.py - ensure_safety - 11 - ERROR - vDisk 36d7e342-7a13-4481-bd22-4e5eb69e0304 - Creating new namespace 82f550ad-0052-4a5c-b164-452ba16ab5c4 failed for Service 159.226.241.80:26301
Traceback (most recent call last):
  File "/opt/OpenvStorage/ovs/lib/mdsservice.py", line 829, in ensure_safety
    if str(vdisk.volume_id) not in client.list_namespaces():
RuntimeError: Connection reset by peer
2019-10-09 15:11:35 43600 +0800 - N04 - 6722/140011463522048 - lib/mdsservice.py - ensure_safety - 12 - WARNING - vDisk 36d7e342-7a13-4481-bd22-4e5eb69e0304: Deleting newly created namespace 82f550ad-0052-4a5c-b164-452ba16ab5c4 for service 159.226.241.47:26301
2019-10-09 15:11:38 56100 +0800 - N04 - 6722/140011463522048 - lib/generic.py - _execute_scrub - 13 - ERROR - Scrubber - vPool pool - StorageRouter N05 - vDisk WEB03_species-data1T - Scrubbing failed
Traceback (most recent call last):
  File "/opt/OpenvStorage/ovs/lib/generic.py", line 339, in _execute_scrub
    MDSServiceController.ensure_safety(vdisk_guid=vdisk_guid)  # Do not use a remote VDisk instance here
  File "/usr/lib/python2.7/site-packages/celery/local.py", line 188, in __call__
    return self._get_current_object()(*a, **kw)
  File "/usr/lib/python2.7/site-packages/celery/app/task.py", line 420, in __call__
    return self.run(*args, **kwargs)
  File "/opt/OpenvStorage/ovs/lib/helpers/decorators.py", line 461, in new_function
    output = f(*args, **kwargs)
  File "/opt/OpenvStorage/ovs/lib/mdsservice.py", line 829, in ensure_safety
    if str(vdisk.volume_id) not in client.list_namespaces():
RuntimeError: Connection reset by peer
2019-10-09 15:11:38 62200 +0800 - N04 - 6722/140011463522048 - lib/generic.py - _execute_scrub - 14 - INFO - Scrubber - vPool pool - StorageRouter N05 - Queue completely processed
2019-10-09 15:11:38 62300 +0800 - N04 - 6722/140011496269568 - lib/generic.py - _deploy_stack_and_scrub - 15 - INFO - Scrubber - vPool pool - StorageRouter N05 - Removing service ovs-albaproxy_e4d3b5d8-b8c2-4806-9d5b-41a9598a0c53_scrub
2019-10-09 15:11:42 07400 +0800 - N04 - 6722/140011496269568 - lib/generic.py - _deploy_stack_and_scrub - 16 - INFO - Scrubber - vPool pool - StorageRouter N05 - Removed service ovs-albaproxy_e4d3b5d8-b8c2-4806-9d5b-41a9598a0c53_scrub
2019-10-09 15:11:42 07500 +0800 - N04 - 6722/140011496269568 - extensions/filemutex.py - release - 17 - WARNING - A lock on /var/lock/ovs_flock_ovs_albaproxy_scrub was kept for 3.45173001289 sec

I find the snapshots cannot be scrubbed if their vdisk's MDSServiceController.ensure_safety() call failed. Not sure.

Any idea would be highly appreciated!

Best regards, Yongsheng

jtorreke commented 5 years ago

Hi Yongsheng,

Scrubbing indeed executes an ensure_safety() first. Since you're getting a connection reset by peer, I'm suspecting the volumedriver who's currently running the MDS'es for that vdisk to have some kind of issue with that. You can find the metadata servers for a given vdisk with vdisk.info. Check the logs of those volumedriver instances to see what they're logging when you get this error.

On another note: OVS filed for bankruptcy on August 29th, so getting answers here will be hard in the future. You can keep using the open source version of OVS though.

yongshengma commented 5 years ago

Hi @jtorreke

I'm very sorry to hear that. I hope you all are going well now. I do have intention to keep on using this version.

Here are a couple of messages which might be related:

(1) Not enough safety Oct 11 16:29:58 N06 celery: 2019-10-11 16:29:58 97500 +0800 - N06 - 18056/140259256035136 - lib/mdsservice.py - ensure_safety - 11680 - INFO - vDisk 2085729c-4275-4022-80bb-1e25b6c760db - * Not enough safety - Current: 1 - Expected: 3

(2) Not enough services in use in primary domain Oct 11 10:55:42 N01 celery: 2019-10-11 10:55:42 75600 +0800 - N01 - 27461/140500594239296 - lib/mdsservice.py - ensure_safety - 10775 - INFO - vDisk cd76c114-4342-490d-b6dc-794a229a7f83 - * Not enough services in use in primary domain - Current: 1 - Expected: 3

What can I do to solve it?

yongshengma commented 5 years ago

another message

Oct 14 04:56:22 N01 celery: 2019-10-14 04:56:22 81400 +0800 - N01 - 14925/140500594239296 - lib/decorators.py - log_message - 26977 - INFO - Ensure single CHAINED mode - ID 1571000182_4aZJiUD9Yp - New task ovs.mds.ensure_safety with params {'vdisk_guid': 'de14124c-e4e9-4b86-b3aa-4a7078de69a7'} scheduled for execution
Oct 14 04:56:22 N01 celery: 2019-10-14 04:56:22 81800 +0800 - N01 - 14925/140500594239296 - lib/mdsservice.py - ensure_safety - 26978 - INFO - vDisk de14124c-e4e9-4b86-b3aa-4a7078de69a7 - Start checkup for vDisk prod_replacePOD_disk2
Oct 14 04:56:22 N01 celery: 2019-10-14 04:56:22 93800 +0800 - N01 - 14925/140500594239296 - lib/mdsservice.py - ensure_safety - 26979 - INFO - vDisk de14124c-e4e9-4b86-b3aa-4a7078de69a7 - Current configuration: [{'ip': '159.226.241.48', 'port': 26300}, {'ip': '159.226.241.77', 'port': 26300}, {'ip': '159.226.241.49', 'port': 26300}]
Oct 14 04:56:22 N01 celery: 2019-10-14 04:56:22 93800 +0800 - N01 - 14925/140500594239296 - lib/mdsservice.py - ensure_safety - 26980 - INFO - vDisk de14124c-e4e9-4b86-b3aa-4a7078de69a7 - Reconfiguration required. Reasons:
Oct 14 04:56:22 N01 celery: 2019-10-14 04:56:22 93800 +0800 - N01 - 14925/140500594239296 - lib/mdsservice.py - ensure_safety - 26981 - INFO - vDisk de14124c-e4e9-4b86-b3aa-4a7078de69a7 -    * Master 159.226.241.48:26300 is not local - Current location: 159.226.241.48 - Expected location: 159.226.241.80
Oct 14 04:58:06 N01 celery: 2019-10-14 04:58:06 99500 +0800 - N01 - 14925/140500594239296 - lib/mdsservice.py - ensure_safety - 26982 - ERROR - vDisk de14124c-e4e9-4b86-b3aa-4a7078de69a7 - Creating new namespace 4ab3b337-06b5-45fb-aaab-11d146aa3b99 failed for Service 159.226.241.80:26301

All the vdisks that failed on ensure_safety have the same message Creating new namespace <uuid> failed for Service 159.226.241.80:26301 . In another words, they all point to 159.226.241.80:26301. I'm not sure if this means something.

I checked related volumerdriver logs and these lines repeatedly showing

Oct 14 17:42:10 N06 volumedriver_fs.sh: 2019-10-14 17:42:10 369455 +0800 - N06 - 2937/0x00007fbd2affb700 - volumedriverfs/XMLRPCTimingWrapper - 00000000b46b803e - info - execute: Arguments for volumeInfo are {[redirect_fenced:true,volume_id:8a7804ff-a321-4c72-b00d-9eb851ca9a07,vrouter_cluster_id:bd3baf59-72a0-41d7-8a9e-61657baf9880]}
Oct 14 17:42:10 N06 volumedriver_fs.sh: 2019-10-14 17:42:10 369541 +0800 - N06 - 2937/0x00007fbd2affb700 - volumedriverfs/XMLRPCRedirectWrapper - 00000000b46b803f - info - execute: Object 8a7804ff-a321-4c72-b00d-9eb851ca9a07 is not present on this node - figuring out if it lives elsewhere.
Oct 14 17:42:10 N06 volumedriver_fs.sh: 2019-10-14 17:42:10 369829 +0800 - N06 - 2937/0x00007fbd2affb700 - volumedriverfs/XMLRPCRedirectWrapper - 00000000b46b8040 - info - execute: Redirecting to 159.226.241.80, port 26201

jtorreke commented 5 years ago

Hi Yongsheng,

Things you can check:

Does the vpool where you're having this issue exist on more than 1 node? MDSes on a vdisk are used for failover. The alternative is to replay all tlogs for a given volume from the backend. This takes too long in case of a storagerouter failing.
Are the volumedrivers running this vpool in the same (failover) domain?

159.226.241.80:26301 will be the IP of a storagerouter and 26301 will be the port of an MDS server running inside a volumedriver. Check the logs of that volumedriver whilst you run mds_checkup.

Those volumedriver logs only say that somebody is requesting information about a vdisk, but that vdisk is not running on the local instance. It checks the cluster for where the vdisk is running and redirects to the volumedriver where the vdisk is running to send the same volumeInfo call again.

yongshengma commented 5 years ago

HI @jtorreke

I found all vdisks that failed to pass ensure_safety also failed at the same endpoint 159.226.241.80:26301 , so I rebooted that node 159.226.241.80 (physical machine) and then the issue was gone. I didn't find any unusual logs of volumedriver. Snapshot scrubbing has been working well over a week since then.

The vpool exists across 6 nodes. The volumedrivers are in the same domain. In fact I'm using very basic config with only 1 vpool and no domain defined at all. Now I often run mds_checkup and it can finish successfully and quickly. Thanks for your help!

Best regads,

openvstorage / framework

Scrubbing failed #2336