Closed yongshengma closed 5 years ago
Hi Yongsheng,
Scrubbing indeed executes an ensure_safety() first. Since you're getting a connection reset by peer
, I'm suspecting the volumedriver who's currently running the MDS'es for that vdisk to have some kind of issue with that. You can find the metadata servers for a given vdisk with vdisk.info. Check the logs of those volumedriver instances to see what they're logging when you get this error.
On another note: OVS filed for bankruptcy on August 29th, so getting answers here will be hard in the future. You can keep using the open source version of OVS though.
Hi @jtorreke
I'm very sorry to hear that. I hope you all are going well now. I do have intention to keep on using this version.
Here are a couple of messages which might be related:
(1) Not enough safety
Oct 11 16:29:58 N06 celery: 2019-10-11 16:29:58 97500 +0800 - N06 - 18056/140259256035136 - lib/mdsservice.py - ensure_safety - 11680 - INFO - vDisk 2085729c-4275-4022-80bb-1e25b6c760db - * Not enough safety - Current: 1 - Expected: 3
(2) Not enough services in use in primary domain
Oct 11 10:55:42 N01 celery: 2019-10-11 10:55:42 75600 +0800 - N01 - 27461/140500594239296 - lib/mdsservice.py - ensure_safety - 10775 - INFO - vDisk cd76c114-4342-490d-b6dc-794a229a7f83 - * Not enough services in use in primary domain - Current: 1 - Expected: 3
What can I do to solve it?
another message
Oct 14 04:56:22 N01 celery: 2019-10-14 04:56:22 81400 +0800 - N01 - 14925/140500594239296 - lib/decorators.py - log_message - 26977 - INFO - Ensure single CHAINED mode - ID 1571000182_4aZJiUD9Yp - New task ovs.mds.ensure_safety with params {'vdisk_guid': 'de14124c-e4e9-4b86-b3aa-4a7078de69a7'} scheduled for execution
Oct 14 04:56:22 N01 celery: 2019-10-14 04:56:22 81800 +0800 - N01 - 14925/140500594239296 - lib/mdsservice.py - ensure_safety - 26978 - INFO - vDisk de14124c-e4e9-4b86-b3aa-4a7078de69a7 - Start checkup for vDisk prod_replacePOD_disk2
Oct 14 04:56:22 N01 celery: 2019-10-14 04:56:22 93800 +0800 - N01 - 14925/140500594239296 - lib/mdsservice.py - ensure_safety - 26979 - INFO - vDisk de14124c-e4e9-4b86-b3aa-4a7078de69a7 - Current configuration: [{'ip': '159.226.241.48', 'port': 26300}, {'ip': '159.226.241.77', 'port': 26300}, {'ip': '159.226.241.49', 'port': 26300}]
Oct 14 04:56:22 N01 celery: 2019-10-14 04:56:22 93800 +0800 - N01 - 14925/140500594239296 - lib/mdsservice.py - ensure_safety - 26980 - INFO - vDisk de14124c-e4e9-4b86-b3aa-4a7078de69a7 - Reconfiguration required. Reasons:
Oct 14 04:56:22 N01 celery: 2019-10-14 04:56:22 93800 +0800 - N01 - 14925/140500594239296 - lib/mdsservice.py - ensure_safety - 26981 - INFO - vDisk de14124c-e4e9-4b86-b3aa-4a7078de69a7 - * Master 159.226.241.48:26300 is not local - Current location: 159.226.241.48 - Expected location: 159.226.241.80
Oct 14 04:58:06 N01 celery: 2019-10-14 04:58:06 99500 +0800 - N01 - 14925/140500594239296 - lib/mdsservice.py - ensure_safety - 26982 - ERROR - vDisk de14124c-e4e9-4b86-b3aa-4a7078de69a7 - Creating new namespace 4ab3b337-06b5-45fb-aaab-11d146aa3b99 failed for Service 159.226.241.80:26301
All the vdisks that failed on ensure_safety have the same message
Creating new namespace <uuid> failed for Service 159.226.241.80:26301
. In another words, they all point to 159.226.241.80:26301
. I'm not sure if this means something.
I checked related volumerdriver logs and these lines repeatedly showing
Oct 14 17:42:10 N06 volumedriver_fs.sh: 2019-10-14 17:42:10 369455 +0800 - N06 - 2937/0x00007fbd2affb700 - volumedriverfs/XMLRPCTimingWrapper - 00000000b46b803e - info - execute: Arguments for volumeInfo are {[redirect_fenced:true,volume_id:8a7804ff-a321-4c72-b00d-9eb851ca9a07,vrouter_cluster_id:bd3baf59-72a0-41d7-8a9e-61657baf9880]}
Oct 14 17:42:10 N06 volumedriver_fs.sh: 2019-10-14 17:42:10 369541 +0800 - N06 - 2937/0x00007fbd2affb700 - volumedriverfs/XMLRPCRedirectWrapper - 00000000b46b803f - info - execute: Object 8a7804ff-a321-4c72-b00d-9eb851ca9a07 is not present on this node - figuring out if it lives elsewhere.
Oct 14 17:42:10 N06 volumedriver_fs.sh: 2019-10-14 17:42:10 369829 +0800 - N06 - 2937/0x00007fbd2affb700 - volumedriverfs/XMLRPCRedirectWrapper - 00000000b46b8040 - info - execute: Redirecting to 159.226.241.80, port 26201
Hi Yongsheng,
Things you can check:
159.226.241.80:26301 will be the IP of a storagerouter and 26301 will be the port of an MDS server running inside a volumedriver. Check the logs of that volumedriver whilst you run mds_checkup.
Those volumedriver logs only say that somebody is requesting information about a vdisk, but that vdisk is not running on the local instance. It checks the cluster for where the vdisk is running and redirects to the volumedriver where the vdisk is running to send the same volumeInfo call again.
HI @jtorreke
I found all vdisks that failed to pass ensure_safety also failed at the same endpoint 159.226.241.80:26301 , so I rebooted that node 159.226.241.80 (physical machine) and then the issue was gone. I didn't find any unusual logs of volumedriver. Snapshot scrubbing has been working well over a week since then.
The vpool exists across 6 nodes. The volumedrivers are in the same domain. In fact I'm using very basic config with only 1 vpool and no domain defined at all. Now I often run mds_checkup and it can finish successfully and quickly. Thanks for your help!
Best regads,
Hi Buddies,
I do have an emergency for help. The space is being used up again by huge snapshot.
I've ever got issue on deleting snapshots before and the reason was the offending key's fingerprint in known-hosts. But this time it looks very different and I can't figure out.
I also traced line by line in execute_scrub() and got
RuntimeError: Connection reset by peer
:I find the snapshots cannot be scrubbed if their vdisk's
MDSServiceController.ensure_safety()
call failed. Not sure.Any idea would be highly appreciated!
Best regards, Yongsheng