Closed yongshengma closed 5 years ago
Is this log telling backend or asd is full?
May 9 11:12:29 N05 alba: 2019-05-09 11:12:29 047136 +0800 - N05 - 2705/0000 - alba/asd - 1127264 - info - returning error Asd_protocol.Protocol.Error.Full
I tried to restart a halted vidsk and the log shows finished, but it actually not.
May 14 10:34:34 N05 volumedriver_fs.sh: 2019-05-14 10:34:34 210177 +0800 - N05 - 2822/0x00007eff83c7d700 - volumedriverfs/MTServer - 0000000004ebde01 - info - work: You have 2 connections running
May 14 10:34:34 N05 volumedriver_fs.sh: 2019-05-14 10:34:34 210225 +0800 - N05 - 2822/0x00007eff83c7d700 - volumedriverfs/MTServer - 0000000004ebde02 - info - cleanup_threads_locked: Trying to join 2 threads
May 14 10:34:34 N05 volumedriver_fs.sh: 2019-05-14 10:34:34 210313 +0800 - N05 - 2822/0x00007f08aeffd700 - volumedriverfs/XMLRPCTimingWrapper - 0000000004ebde03 - info - execute: Arguments for volumeInfo are {[redirect_fenced:true,volume_id:3539fb79-0067-4ec4-a49c-85d590de62ea,vrouter_cluster_id:bd3baf59-72a0-41d7-8a9e-61657baf9880]}
May 14 10:34:34 N05 volumedriver_fs.sh: 2019-05-14 10:34:34 210387 +0800 - N05 - 2822/0x00007f08aeffd700 - volumedriverfs/MDSMetaDataStore - 0000000004ebde04 - error - do_handle_: 3539fb79-0067-4ec4-a49c-85d590de62ea: no internal MetaDataStore present - all previous MDS failover attempts failed?
May 14 10:34:34 N05 volumedriver_fs.sh: 2019-05-14 10:34:34 210481 +0800 - N05 - 2822/0x00007f08aeffd700 - volumedriverfs/XMLRPCTimingWrapper - 0000000004ebde05 - error - operator(): Failed to retrieve metadata store stats: No internal mdstore present 3539fb79-0067-4ec4-a49c-85d590de62ea - ignored
May 14 10:34:34 N05 volumedriver_fs.sh: 2019-05-14 10:34:34 211133 +0800 - N05 - 2822/0x00007f08aeffd700 - volumedriverfs/XMLRPCTimingWrapper - 0000000004ebde06 - info - execute: Call volumeInfo took 0.000768 seconds
May 14 10:34:34 N05 volumedriver_fs.sh: 2019-05-14 10:34:34 211244 +0800 - N05 - 2822/0x00007f08aeffd700 - volumedriverfs/XMLRPCConnection - 0000000004ebde07 - info - operator(): Closing socket 332
May 14 10:34:34 N05 volumedriver_fs.sh: 2019-05-14 10:34:34 213884 +0800 - N05 - 2822/0x00007eff83c7d700 - volumedriverfs/MTServer - 0000000004ebde08 - info - work: You have 2 connections running
May 14 10:34:34 N05 volumedriver_fs.sh: 2019-05-14 10:34:34 213930 +0800 - N05 - 2822/0x00007eff83c7d700 - volumedriverfs/MTServer - 0000000004ebde09 - info - cleanup_threads_locked: Trying to join 2 threads
May 14 10:34:34 N05 volumedriver_fs.sh: 2019-05-14 10:34:34 213978 +0800 - N05 - 2822/0x00007f08b4ff9700 - volumedriverfs/XMLRPCTimingWrapper - 0000000004ebde0a - info - execute: Arguments for restartVolume are {[force:false,volume_id:3539fb79-0067-4ec4-a49c-85d590de62ea,vrouter_cluster_id:bd3baf59-72a0-41d7-8a9e-61657baf9880]}
May 14 10:34:34 N05 volumedriver_fs.sh: 2019-05-14 10:34:34 214358 +0800 - N05 - 2822/0x00007f08b4ff9700 - volumedriverfs/VFSLocalNode - 0000000004ebde0b - warning - local_restart_volume_: 3539fb79-0067-4ec4-a49c-85d590de62ea already running
May 14 10:34:34 N05 volumedriver_fs.sh: 2019-05-14 10:34:34 214604 +0800 - N05 - 2822/0x00007f08b4ff9700 - volumedriverfs/XMLRPCTimingWrapper - 0000000004ebde0c - info - execute: Call restartVolume took 0.000579 seconds
May 14 10:34:34 N05 volumedriver_fs.sh: 2019-05-14 10:34:34 214685 +0800 - N05 - 2822/0x00007f08b4ff9700 - volumedriverfs/XMLRPCConnection - 0000000004ebde0d - info - operator(): Closing socket 332
May 14 10:34:34 N05 celery: 2019-05-14 10:34:34 23600 +0800 - N05 - 5286/140549621679936 - celery/job.py - on_success - 157014 - INFO - Task ovs.vdisk.restart[f601a621-5899-409a-b820-88e376b0da18] succeeded in 0.0408876431175s: None
A volume instance in halted
state is still present but does not accept I/O anymore. This state is entered after a fatal error or after HA forcefully started the volume elsewhere due to a network partition.
The restart_object
starts up volumes that are not present. To get the volume out of that state, a stop_object
call has to be issued first (FWIW, these are the names of the calls in the volumedriver API).
Asd_protocol.Protocol.Error.Full
means that a particular ASD is full. This will result in failing uploads of fragments. If only a few ASDs are full, it will not cause problems. If they all are full, it will mean you cannot write anymore (or repair).
What does alba list-osds --config ...
say?
Hi @redlicha
All halted volumes are started up after being stopped first. Thanks!
Best regards
Hi @toolslive
Last week the used space of backend reached 99%. I also checked 'df -h' and saw every mounted devices under /mnt/alba-asd/ on every node had reached 99%. The used space dropped to 32% after a big snapshot (5.80T) was removed.
Using alba list-osds --config ...
, I'm able to see detailed statistics. The ratio converted from used and total is 32% now for every ASD. Thanks for this cmd.
Two snippets:
5 : { Nsm_model.OsdInfo.kind =
(Nsm_model.OsdInfo.Asd ((["10.20.107.48"], 8601, false, false),
"65NTHVAujRqCKjji2JPKfu7mZlRTJ9Hb"));
decommissioned = false; node_id = "tiVrMBrphrBrn6Qv"; other = "other?";
total = 1199110983680L; used = 356381862657L;
seen = [2019/01/10_16:26:10.5534];
read =
[2019/05/15_11:49:0.5546; 2019/05/15_11:48:40.1238;
2019/05/15_11:48:25.5416; 2019/05/15_11:48:22.7101;
2019/05/15_11:47:50.1931; 2019/05/15_11:46:58.2712;
2019/05/15_11:46:47.1286; 2019/05/15_11:46:36.5023;
2019/05/15_11:45:43.7541; 2019/05/15_11:45:26.7668];
write =
[2019/05/15_11:48:23.4295; 2019/05/15_11:48:6.7567;
2019/05/15_11:46:40.1592; 2019/05/15_11:46:27.7731;
2019/05/15_11:46:23.8440; 2019/05/15_11:46:9.2203;
2019/05/15_11:46:8.9008; 2019/05/15_11:46:4.4235;
2019/05/15_11:46:4.1538; 2019/05/15_11:46:3.0537];
errors =
[(2019/05/11_08:40:5.6793, "Lwt.Canceled");
(2019/05/11_08:40:5.6789, "Lwt.Canceled");
(2019/05/11_08:39:23.3614, "Lwt.Canceled");
(2019/05/11_08:39:22.4070, "Lwt.Canceled");
(2019/05/11_08:38:24.0517, "Lwt.Canceled");
(2019/05/11_08:38:23.9245, "Lwt.Canceled");
(2019/05/11_08:38:23.5558, "Lwt.Canceled");
(2019/05/11_08:38:23.2138, "Lwt.Canceled");
(2019/05/11_08:38:22.9788, "Lwt.Canceled");
(2019/05/11_08:38:22.9787, "Lwt.Canceled")];
checksum_errors = 0L; claimed_since = (Some 2019/01/10_16:26:9.8235) }
...
23 : { Nsm_model.OsdInfo.kind =
(Nsm_model.OsdInfo.Asd ((["10.20.107.80"], 8600, false, false),
"U78qHgYN9GJdrOIy8LwVu74gYmqv5s3m"));
decommissioned = false; node_id = "SykSdKjhdNGgJe2d"; other = "other?";
total = 1198410878976L; used = 380962453212L;
seen = [2019/02/14_17:40:46.3412];
read =
[2019/05/15_11:49:15.6229; 2019/05/15_11:49:4.6534;
2019/05/15_11:48:57.4087; 2019/05/15_11:48:24.7683;
2019/05/15_11:48:22.7118; 2019/05/15_11:48:18.4301;
2019/05/15_11:47:48.4083; 2019/05/15_11:47:41.7724;
2019/05/15_11:47:36.7559; 2019/05/15_11:47:33.5664];
write =
[2019/05/15_11:48:6.4994; 2019/05/15_11:47:23.0258;
2019/05/15_11:47:17.7167; 2019/05/15_11:46:40.2072;
2019/05/15_11:46:27.7728; 2019/05/15_11:46:10.7085;
2019/05/15_11:46:9.2879; 2019/05/15_11:46:8.8242;
2019/05/15_11:46:7.8227; 2019/05/15_11:46:4.3758];
errors =
[(2019/04/26_18:39:16.9345,
"(Unix.Unix_error \"Connection refused\" connect \"\")");
(2019/04/26_18:39:12.4381,
"(Unix.Unix_error \"Connection refused\" connect \"\")");
(2019/04/26_18:39:8.8161,
"(Unix.Unix_error \"No route to host\" connect \"\")");
(2019/04/26_18:39:5.7900,
"(Unix.Unix_error \"No route to host\" connect \"\")");
(2019/04/26_18:39:5.7888, "Networking2.ConnectTimeout");
(2019/04/26_18:39:3.3795, "Networking2.ConnectTimeout");
(2019/04/26_18:39:2.7775, "Networking2.ConnectTimeout");
(2019/04/26_18:38:58.3335,
"(Unix.Unix_error \"Connection refused\" connect \"\")");
(2019/04/26_18:38:56.3672,
"(Unix.Unix_error \"Connection refused\" connect \"\")");
(2019/04/26_18:38:55.3317, "Networking2.ConnectTimeout")];
checksum_errors = 0L; claimed_since = (Some 2019/02/14_17:40:47.8028) }
The snapshots occupy extra space besides volume's itself, don't they? 3x too if 3-way preset chosen?
Best regards
If you write to your volume, the LBAs will be packed in containers of (lets say) 64MB. These containers are uploaded to ALBA (the object store in OVS). If you overwrite the data, the older data (and thus the containters it resides in) are kept until the snapshot referencing that data is deleted. So deleting of snapshots and scrubbing will keep the used space in check.
To get some insight on the data multipliers between the volume and the backend you find that
during upload the preset's policy for that volume is applied.
The policy could be 3-way replication (times 3), it could also be (9,5)-erasure coding
(times (9+5)/9 ) or something else (compression can play a role too).
You can check the policies with alba list-presets
.
To see the data used via the cli, you can do the following
$> alba list-namespaces --config ...
$> alba show-namespace <namespace-name> --config ...
Tip: most of these alba command also have a --to-json
flag so you can parse the output into jq
(for example)
Appreciate!
Hello,
A vdisk is halted and cannot start. Its status shows
Halted
and DTL showsDTL location optimization possible
.No change on status and still halted.
Grep its volume id:
These lines keep on popping out...
There's error some time ago. I'm not sure if it is a cause:
Any clue to start this vdisk?
Context of this issue: backend was used up with 260G left (1%) and 26T totally.
Best regards, Yongsheng