Closed gregory112 closed 2 years ago
[Nomad 1.1.2]
I have similar issues using the AWS EBS plugin. And I wholeheartedly share the sentiment that it's really hard to reproduce or debug, which is why I've been reluctant to opening an issue. In my case Nomad workers are spot instances which come and go. Non CSI jobs get rescheduled fine, but those which have CSI volumes attached tend to go red on the first reschedule and then eventually succeed. It's almost like the CSI mechanism needs more time to do it's thing, before the job gets restarted...
I have been investigating it for a while.
A manual solution that works is to deregister the volume with -force, or at least detaching the volume with nomad volume detach
. Deregistering means you need to register the volume again and if you used nomad volume create
as I did, you will have to get the external ID and reregister the volume again with the same external ID. Now nomad volume detach
is probably the most straightforward approach. I learn that what it actually do is to unmount the mount point of the particular volume in the particular node. And this sometimes hang. When hanging, I see an umount
process in the node hanging indefinitely. Network connectivity is never reliable, so I think cases like this needs attention.
I don't know what Nomad does with unused volumes after the job is stopped. Does Nomad instruct the plugin to unmount the volumes from nodes? What if it fails to do so then? Maybe a timeout, or more log messages, or anything should be added in nomad alloc status
.
umount
hangs often especially in the case of mounting network file systems like this. And when umount hangs, nomad volume detach
hangs too. The solution in my case is to actually kill the umount process in nodes and retry with umount -f
, but I don't know if Nomad can do this, as I think this might be managed by the CSI plugin. But due to the variations that exist in a lot different CSI plugins, I think cases like this should be handled by Nomad somehow.
We have the very same issue with Nomad 1.1.3 and AWS EBS CSI Plugin v1.2.0. We need to revert to host_volumes as this issue threatens our cluster health when important stateful services like Prometheus can't restart correctly.
The volume is unmounted currently in AWS and shows state = "Available"
, so it's clearly an internal Nomad state issue.
Well this is bad, as I don't think the issue is solved by Nomad garbage collector. When a volume hangs in one Nomad node, Nomad will just allocate the job to another node. And when it hangs there, it will allocate the job to another node again. Imagine if we have 100 Nomad nodes and the volumes stuck in some of them.
I am seeing the same issues using the gcp-compute-persistent-disk-csi-driver
plugin and Nomad v1.1.1
. Interestingly the Read Volume API endpoint shows the correct amount of writers but the List Volumes endpoint shows too many CurrentWriters
. I guess this is what's stopping the reassignment of the volume
We've faced this issue as well. Seems like Nomad fails to realise that there are not allocations present for the volume.
From what we could make out, both AWS and the CSI plugin report that the volume is available to mount:
$ nomad volume status -verbose -namespace=workspace workspace
ID = workspace
Name = workspace
External ID = vol-xxxxxxxxxxxxx
Plugin ID = aws-ebs0
Provider = ebs.csi.aws.com
Version = v0.10.1
Schedulable = true
Controllers Healthy = 1
Controllers Expected = 1
Nodes Healthy = 2
Nodes Expected = 2
Access Mode = single-node-writer
Attachment Mode = file-system
Mount Options = <none>
Namespace = workspace
And the AWS console reports the volume is "Available".
Whereas in the Nomad UI, the "Storage" tab reports multiple #Allocs
for the aforementioned volume, and clicking on the volume shows that there are neither read nor write mounts, which potentially points to my hunch that Nomad somehow isn't aware that the volume has been detached by the driver and is not in use by any of the allocations. Here's a gist of logs from the CSI controller for the same:
I0816 06:26:49.482851 1 controller.go:329] ControllerUnpublishVolume: called with args {VolumeId:vol-XXXXXXXXXX NodeId:i-YYYYYYYYYYY Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0816 06:26:56.194278 1 controller.go:346] ControllerUnpublishVolume: volume vol-XXXXXXXXXX detached from node i-YYYYYYYYYYY
One thing I have noticed is that on a newly registered CSI volume, the metadata shows up as:
Access Mode = <none>
Attachment Mode = <none>
And while on the volume that is "Schedulable" (not really, in terms of Nomad) it shows up as:
Access Mode = single-node-writer
Attachment Mode = file-system
Could be pointing to something, perhaps?
The only solution that has worked for us so far is to -force
deregister the volume and re-register it. Would love to have this issue solved. Let me know if I can help with any more information.
Thank you all for the detailed information (keep them coming if you have more!).
It seems like this issue happens sporadically when allocations that use a volume are restarted. I will try to create a high churn environment and see if I can reproduce it.
Hi everyone, just a quick update. I've been running a periodic job for a couple of days now, and so far I haven't seen any issues.
In hindsight I should've probably used a service
job since they are likely more common they have some different scheduling logic when compared to batch
.
For those who have seen this issue, do you have any kind of update
block set? This could affect how new allocations are created and I want to make sure I have a good reproduction environment.
Thanks!
Hey @lgfa29 you can try out the following job. It's a basic prometheus
deployment without any special update
block.
I regularly have issues when trying to update this for, for example to add more resources.
job "prometheus" {
datacenters = ["dc1"]
type = "service"
group "monitoring" {
count = 2
constraint {
operator = "distinct_hosts"
value = "true"
}
volume "data" {
type = "csi"
source = "prometheus-disk"
attachment_mode = "file-system"
access_mode = "single-node-writer"
per_alloc = true
}
network {
port "http" {
static = 9090
}
}
service {
name = "prometheus2"
tags = ["prometheus2"]
task = "prometheus"
port = "http"
check {
type = "http"
port = "http"
path = "/-/ready"
interval = "10s"
timeout = "5s"
}
}
task "prometheus" {
driver = "docker"
user = "root"
volume_mount {
volume = "data"
destination = "/prometheus"
}
resources {
memory = 1024
cpu = 1024
}
template {
data = <<EOT
---
global:
scrape_interval: 10s
external_labels:
__replica__: "{{ env "NOMAD_ALLOC_ID" }}"
scrape_configs:
- job_name: "prometheus"
scrape_interval: 10s
consul_sd_configs:
- server: "{{ env "attr.unique.network.ip-address" }}:8500"
services:
- prometheus2
relabel_configs:
- source_labels: ["__meta_consul_node"]
regex: "(.*)"
target_label: "node"
replacement: "$1"
- source_labels: ["__meta_consul_service_id"]
regex: "(.*)"
target_label: "instance"
replacement: "$1"
- source_labels: ["__meta_consul_dc"]
regex: "(.*)"
target_label: "datacenter"
replacement: "$1"
# Nomad metrics
- job_name: "nomad_metrics"
consul_sd_configs:
- server: "{{ env "attr.unique.network.ip-address" }}:8500"
services: ["nomad-client", "nomad"]
relabel_configs:
- source_labels: ["__meta_consul_tags"]
regex: "(.*)http(.*)"
action: keep
- source_labels: ["__meta_consul_service"]
regex: "(.*)"
target_label: "job"
replacement: "$1"
- source_labels: ["__meta_consul_node"]
regex: "(.*)"
target_label: "node"
replacement: "$1"
- source_labels: ["__meta_consul_service_id"]
regex: "(.*)"
target_label: "instance"
replacement: "$1"
- source_labels: ["__meta_consul_dc"]
regex: "(.*)"
target_label: "datacenter"
replacement: "$1"
scrape_interval: 5s
metrics_path: /v1/metrics
params:
format: ["prometheus"]
# Consul metrics
- job_name: "consul_metrics"
consul_sd_configs:
- server: "{{ env "attr.unique.network.ip-address" }}:8500"
services: ["consul-agent"]
relabel_configs:
- source_labels: ["__meta_consul_tags"]
regex: "(.*)http(.*)"
action: keep
- source_labels: ["__meta_consul_service"]
regex: "(.*)"
target_label: "job"
replacement: "$1"
- source_labels: ["__meta_consul_node"]
regex: "(.*)"
target_label: "node"
replacement: "$1"
- source_labels: ["__meta_consul_service_id"]
regex: "(.*)"
target_label: "instance"
replacement: "$1"
- source_labels: ["__meta_consul_dc"]
regex: "(.*)"
target_label: "datacenter"
replacement: "$1"
scrape_interval: 5s
metrics_path: /v1/agent/metrics
params:
format: ["prometheus"]
EOT
destination = "local/prometheus.yml"
}
config {
image = "quay.io/prometheus/prometheus"
ports = ["http"]
args = [
"--config.file=${NOMAD_TASK_DIR}/prometheus.yml",
"--log.level=info",
"--storage.tsdb.retention.time=1d",
"--storage.tsdb.path=/prometheus",
"--web.console.libraries=/usr/share/prometheus/console_libraries",
"--web.console.templates=/usr/share/prometheus/consoles"
]
}
}
}
}
For those who have seen this issue, do you have any kind of
update
block set? This could affect how new allocations are created and I want to make sure I have a good reproduction environment.
Mine has.
update {
health_check = "checks"
min_healthy_time = "10s"
healthy_deadline = "15m"
progress_deadline = "20m"
}
Also you can try to fail the job. For example due to pull fail or running fail to the point the deployment exceeds deadline. This is what really makes the error frequently show up. I also have service
block set up to provide health checks.
Yes, please try with service
job. What made the error show up in my case is task failure, and stopping and replanning the task. Replanning without stopping can sometimes cause the issue to appear too, for example updating the image version used by the job.
Also I have two master nodes. Don't know if this contributes to this but does anyone use two master nodes too for this? Or maybe more? Maybe a race condition error between master nodes? Split brain perhaps?
Also, here's mine
variables {
mycluster_zookeeper_image = "zookeeper:3.6"
}
job "mycluster-zookeeper" {
region = "global"
datacenters = ["dc1"]
type = "service"
update {
health_check = "checks"
min_healthy_time = "10s"
healthy_deadline = "15m"
progress_deadline = "20m"
}
group "zookeeper1" {
count = 1
restart {
interval = "10m"
attempts = 2
delay = "10s"
mode = "fail"
}
network {
mode = "cni/weave"
}
service {
name = "mycluster-zookeeper1-peer"
port = "2888"
address_mode = "alloc"
}
service {
name = "mycluster-zookeeper1-leader"
port = "3888"
address_mode = "alloc"
}
service {
name = "mycluster-zookeeper1-client"
port = "2181"
address_mode = "alloc"
}
volume "data" {
type = "csi"
read_only = false
source = "mycluster-zookeeper1-data"
attachment_mode = "file-system"
access_mode = "single-node-writer"
mount_options {
fs_type = "ext4"
mount_flags = ["noatime"]
}
}
volume "datalog" {
type = "csi"
read_only = false
source = "mycluster-zookeeper1-datalog"
attachment_mode = "file-system"
access_mode = "single-node-writer"
mount_options {
fs_type = "ext4"
mount_flags = ["noatime"]
}
}
task "zookeeper" {
driver = "docker"
volume_mount {
volume = "data"
destination = "/data"
read_only = false
}
volume_mount {
volume = "datalog"
destination = "/datalog"
read_only = false
}
config {
image = var.mycluster_zookeeper_image
image_pull_timeout = "10m"
}
template {
destination = "local/zookeeper.env"
data = <<EOH
ZOO_SERVERS=server.1:0.0.0.0:2888:3888;2181 server.2:{{ with service "mycluster-zookeeper2-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.3:{{ with service "mycluster-zookeeper3-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181
EOH
env = true
}
env {
ZOO_MY_ID = "1"
ZOO_CFG_EXTRA = "minSessionTimeout=10000"
}
resources {
cpu = 500
memory = 512
}
}
}
group "zookeeper2" {
count = 1
restart {
interval = "10m"
attempts = 2
delay = "10s"
mode = "fail"
}
network {
mode = "cni/weave"
}
service {
name = "mycluster-zookeeper2-peer"
port = "2888"
address_mode = "alloc"
}
service {
name = "mycluster-zookeeper2-leader"
port = "3888"
address_mode = "alloc"
}
service {
name = "mycluster-zookeeper2-client"
port = "2181"
address_mode = "alloc"
}
volume "data" {
type = "csi"
read_only = false
source = "mycluster-zookeeper2-data"
attachment_mode = "file-system"
access_mode = "single-node-writer"
mount_options {
fs_type = "ext4"
mount_flags = ["noatime"]
}
}
volume "datalog" {
type = "csi"
read_only = false
source = "mycluster-zookeeper2-datalog"
attachment_mode = "file-system"
access_mode = "single-node-writer"
mount_options {
fs_type = "ext4"
mount_flags = ["noatime"]
}
}
task "zookeeper" {
driver = "docker"
volume_mount {
volume = "data"
destination = "/data"
read_only = false
}
volume_mount {
volume = "datalog"
destination = "/datalog"
read_only = false
}
config {
image = var.mycluster_zookeeper_image
image_pull_timeout = "10m"
}
template {
destination = "local/zookeeper.env"
data = <<EOH
ZOO_SERVERS=server.1:{{ with service "mycluster-zookeeper1-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.2:0.0.0.0:2888:3888;2181 server.3:{{ with service "mycluster-zookeeper3-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181
EOH
env = true
}
env {
ZOO_MY_ID = "2"
ZOO_CFG_EXTRA = "minSessionTimeout=10000"
}
resources {
cpu = 500
memory = 512
}
}
}
group "zookeeper3" {
count = 1
restart {
interval = "10m"
attempts = 2
delay = "10s"
mode = "fail"
}
network {
mode = "cni/weave"
}
service {
name = "mycluster-zookeeper3-peer"
port = "2888"
address_mode = "alloc"
#connect {
# sidecar_service {
# disable_default_tcp_check = true
#
# proxy {
# upstreams {
# destination_name = "mycluster-zookeeper1-peer"
# local_bind_port = "12888"
# }
# upstreams {
# destination_name = "mycluster-zookeeper2-peer"
# local_bind_port = "22888"
# }
# }
# }
#}
}
service {
name = "mycluster-zookeeper3-leader"
port = "3888"
address_mode = "alloc"
#connect {
# sidecar_service {
# disable_default_tcp_check = true
#
# proxy {
# upstreams {
# destination_name = "mycluster-zookeeper1-leader"
# local_bind_port = "13888"
# }
# upstreams {
# destination_name = "mycluster-zookeeper2-leader"
# local_bind_port = "23888"
# }
# }
# }
#}
}
service {
name = "mycluster-zookeeper3-client"
port = "2181"
address_mode = "alloc"
#connect {
# sidecar_service {
# disable_default_tcp_check = true
# #proxy {
# # upstreams {
# # destination_name = "mycluster-zookeeper1-client"
# # local_bind_port = "12181"
# # }
# # upstreams {
# # destination_name = "mycluster-zookeeper2-client"
# # local_bind_port = "22181"
# # }
# # upstreams {
# # destination_name = "mycluster-zookeeper3-client"
# # local_bind_port = "32181"
# # }
# #}
# }
#}
}
volume "data" {
type = "csi"
read_only = false
source = "mycluster-zookeeper3-data"
attachment_mode = "file-system"
access_mode = "single-node-writer"
mount_options {
fs_type = "ext4"
mount_flags = ["noatime"]
}
}
volume "datalog" {
type = "csi"
read_only = false
source = "mycluster-zookeeper3-datalog"
attachment_mode = "file-system"
access_mode = "single-node-writer"
mount_options {
fs_type = "ext4"
mount_flags = ["noatime"]
}
}
task "zookeeper" {
driver = "docker"
volume_mount {
volume = "data"
destination = "/data"
read_only = false
}
volume_mount {
volume = "datalog"
destination = "/datalog"
read_only = false
}
config {
image = var.mycluster_zookeeper_image
image_pull_timeout = "10m"
}
template {
destination = "local/zookeeper.env"
data = <<EOH
ZOO_SERVERS=server.1:{{ with service "mycluster-zookeeper1-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.2:{{ with service "mycluster-zookeeper2-peer" }}{{ with index . 0 }}{{ .Address }}{{ end }}:2888{{ else }}127.0.0.1:2888{{ end }}:3888;2181 server.3:0.0.0.0:2888:3888;2181
EOH
env = true
}
env {
ZOO_MY_ID = "3"
ZOO_CFG_EXTRA = "minSessionTimeout=10000"
}
resources {
cpu = 500
memory = 512
}
}
}
}
And my hcl file used to create all the volumes with nomad volume create
:
id = "mycluster-zookeeper1-data"
name = "zookeeper1-data"
type = "csi"
plugin_id = "ceph-csi"
capacity_min = "1G"
capacity_max = "10G"
capability {
access_mode = "single-node-writer"
attachment_mode = "file-system"
}
mount_options {
fs_type = "ext4"
mount_flags = ["noatime"]
}
secrets {
adminID = "redacted"
adminKey = "redacted"
}
parameters {
clusterID = "redacted"
fsName = "myfsname"
mounter = "kernel"
}
Thanks for the sample jobs @JanMa and @gregory112.
It seems like update
is not part of the problem, so I create this job that has a 10% chance of failing. I will leave it running and see if the issue is triggered.
job "random-fail" {
datacenters = ["dc1"]
type = "service"
group "random-fail" {
volume "ebs-vol" {
type = "csi"
read_only = false
source = "ebs-vol"
attachment_mode = "file-system"
access_mode = "single-node-writer"
mount_options {
fs_type = "ext4"
mount_flags = ["noatime"]
}
}
task "random-fail" {
driver = "docker"
config {
image = "alpine:3.14"
command = "/bin/ash"
args = ["/local/script.sh"]
}
template {
data = <<EOF
#!/usr/bin/env bash
while true;
do
echo "Rolling the dice..."
n=$(($RANDOM % 10))
echo "Got ${n}!"
if [[ 0 -eq ${n} ]];
then
echo "Bye :wave:"
exit 1;
fi
echo "'Til the next round."
sleep 10;
done
EOF
destination = "local/script.sh"
}
volume_mount {
volume = "ebs-vol"
destination = "/volume"
read_only = false
}
}
}
}
For some reason nomad thinks that volume is still in use while it is not. nomad volume deregister returns "Error deregistering volume: Unexpected response code: 500 (rpc error: volume in use: nessus)". nomad volume deregister -force followed by nomad system gc and then registering it again seems to help.
I think I was able to reproduce this with a self-updating job:
job "update" {
datacenters = ["dc1"]
type = "service"
group "update" {
volume "ebs-vol" {
type = "csi"
read_only = false
source = "ebs-vol2"
attachment_mode = "file-system"
access_mode = "single-node-writer"
mount_options {
fs_type = "ext4"
mount_flags = ["noatime"]
}
}
task "update" {
driver = "docker"
config {
image = "alpine:3.14"
command = "/bin/sh"
args = ["/local/script.sh"]
}
template {
data = <<EOF
#!/usr/bin/env bash
apk add curl jq
while true;
do
sleep 10
jobspec=$(curl http://172.17.0.1:4646/v1/job/${NOMAD_JOB_ID})
cpu=$(echo $jobspec | jq '.TaskGroups[0].Tasks[0].Resources.CPU')
if [ $cpu -eq 500 ]
then
cpu=600
else
cpu=500
fi
new_jobspec=$(echo $jobspec | jq ".TaskGroups[0].Tasks[0].Resources.CPU = ${cpu}")
echo $new_jobspec | jq '{"Job":.}' | curl -H "Content-Type: application/json" -X POST --data @- http:/172.17.0.1:4646/v1/job/${NOMAD_JOB_ID}
done
EOF
destination = "local/script.sh"
}
volume_mount {
volume = "ebs-vol"
destination = "/volume"
read_only = false
}
resources {
cpu = 500
memory = 256
}
}
}
}
This job will usually fail first, and then work.
It's not exactly the same error message, but maybe it's related:
failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: controller publish: attach volume: controller attach volume: CSI.ControllerAttachVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = Could not attach volume "vol-0fa6e88e4c3b55d29" to node "i-02396ab34ed318944": attachment of disk "vol-0fa6e88e4c3b55d29" failed, expected device to be attached but was detaching
Looking at the volume details page, I see the previous allocation still listed under Write Allocations
for a while, even after it's marked as Complete
. I don't know if this is related, since another alloc is able to start, but may be be a pointer in the right direction
This nasty workaround seems to be working for DigitalOcean. If your task restarts frequently, it will spam your cluster with jobs, so be careful with that.
your-job.nomad.tpl
:
# ...
task "reregister_volume" {
lifecycle {
hook = "poststop"
sidecar = false
}
driver = "docker"
config {
image = "alpine:3.14"
entrypoint = [
"/bin/sh",
"-eufc",
<<-EOF
apk add curl
curl --fail --data '@-' -X POST \
"http://$${attr.unique.network.ip-address}:4646/v1/job/reregister-volume/dispatch" <<-EndOfData
{
"Meta": {
"csi_id": "${csi_id_prefix}[$${NOMAD_ALLOC_INDEX}]",
"csi_id_uri_component": "${csi_id_prefix}%5B$${NOMAD_ALLOC_INDEX}%5D",
"volume_name": "${volume_name_prefix}$${NOMAD_ALLOC_INDEX}",
"volume_plugin_id": "${volume_plugin_id}"
}
}
EndOfData
EOF
]
}
}
# ...
register-volume.nomad.tpl:
job "reregister-volume" {
type = "batch"
parameterized {
payload = "forbidden"
meta_required = ["csi_id", "csi_id_uri_component", "volume_name", "volume_plugin_id"]
}
group "reregister" {
task "reregister" {
driver = "docker"
config {
image = "alpine:3.14"
entrypoint = [
"/bin/sh",
"-eufc",
<<-EOF
sleep 5 # Wait for job to stop
apk add jq curl
echo "CSI_ID=$${NOMAD_META_CSI_ID}"
echo "CSI_ID_URI_COMPONENT=$${NOMAD_META_CSI_ID_URI_COMPONENT}"
echo "VOLUME_NAME=$${NOMAD_META_VOLUME_NAME}"
echo "VOLUME_PLUGIN_ID=$${NOMAD_META_VOLUME_PLUGIN_ID}"
n=0
until [ "$n" -ge 15 ]; do
echo "> Checking if volume exists (attempt $n)"
curl --fail -X GET \
"http://$${attr.unique.network.ip-address}:4646/v1/volumes?type=csi" \
| jq -e '. | map(.ID == "$${NOMAD_META_CSI_ID}") | any | not' && break
n=$((n+1))
sleep 1
echo
echo '> Force detachign volume'
curl --fail -X DELETE \
"http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}?force=true" \
|| echo ' Detaching failed'
done
if [ "$n" -ge 15 ]; then
echo ' Deregister failed too many times, giving up'
exit 0
else
echo ' Deregister complete'
fi
echo
echo '> Fetching external volume ID'
VOLUME_JSON=$(
curl --fail -X GET \
-H "Authorization: Bearer ${digitalocean_token_persistent}" \
"https://api.digitalocean.com/v2/volumes?name=$${NOMAD_META_VOLUME_NAME}" \
| jq '.volumes[0]'
)
VOLUME_ID=$(
echo "$VOLUME_JSON" | jq -r '.id'
)
VOLUME_REGION=$(
echo "$VOLUME_JSON" | jq -r '.region.slug'
)
VOLUME_DROPLET_ID=$(
echo "$VOLUME_JSON" | jq -r '.droplet_ids[0] // empty'
)
echo "VOLUME_ID=$VOLUME_ID"
echo "VOLUME_REGION=$VOLUME_ID"
echo "VOLUME_DROPLET_ID=$VOLUME_DROPLET_ID"
if [ ! -z "$VOLUME_DROPLET_ID" ]; then
echo
echo '> Detaching volume on DigitalOcean'
curl --fail -X POST \
-H "Authorization: Bearer ${digitalocean_token_persistent}" \
-d "{\"type\": \"detach\", \"droplet_id\": \"$VOLUME_DROPLET_ID\", \"region\": \"$VOLUME_REGION\"}" \
"https://api.digitalocean.com/v2/volumes/$VOLUME_ID/actions"
fi
echo
echo '> Re-registering volume'
curl --fail --data '@-' -X PUT \
"http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}" <<-EndOfVolume
{
"Volumes": [
{
"ID": "$${NOMAD_META_CSI_ID}",
"Name": "$${NOMAD_META_VOLUME_NAME}",
"ExternalID": "$VOLUME_ID",
"PluginID": "$${NOMAD_META_VOLUME_PLUGIN_ID}",
"RequestedCapabilities": [{
"AccessMode": "single-node-writer",
"AttachmentMode": "file-system"
}]
}
]
}
EndOfVolume
echo
echo '> Reading volume'
curl --fail -X GET \
"http://$${attr.unique.network.ip-address}:4646/v1/volume/csi/$${NOMAD_META_CSI_ID_URI_COMPONENT}"
echo
echo 'Finished'
EOF
]
}
}
}
}
Same issue on my cluster with EFS plugin (docker: amazon/aws-efs-csi-driver:v1.3.3) It usually happens after a restart of some node. No Write Allocations, No Read Allocations, however failed to mount because CSI volume has exhausted its available writer claims.
One thing needs to be mentioned is that, right after force deregistering the volume, my Node Health recovered from (4/5) to (5/5). EFS plugin cannot be deployed to that node before deregistering.
[Nomad v1.1.3]
We have the exact same issue with the Linstor CSI driver (https://linbit.com/linstor/). Our use-case is to be able to fail-over a job when one server or a DC dies. We are still in the implementation phase so we were intentionally killing servers where nomad jobs are running to reproduce the fail-over use-cases.
Now, it works a few times and nomad does a proper job, but after a random number of re-schedules it fails with the same error message and the same behavior as described in this issue. In the Nomad UI volume list the volume has an allocation set, but when we go into the detailed view there is no allocs present. Basically the same behavior that @Thunderbottom reported.
Yep, I'm running a really basic setup with just one server/client and one client and this happens to me too, I need to -force
deregister the volume and then re-register it to fix it
Just a kind reminder for those who are using AWS EFS: Make sure you are using the suggested mount options given by AWS. https://docs.aws.amazon.com/efs/latest/ug/efs-mount-helper.html
especially noresvport flag.
Or it will cause infinite time wait io (100% wa for top) once tcp connection breaks.
We didn't observe any more problem after correcting the mount options.
Same issue on my cluster with EFS plugin (docker: amazon/aws-efs-csi-driver:v1.3.3) It usually happens after a restart of some node. No Write Allocations, No Read Allocations, however failed to mount because CSI volume has exhausted its available writer claims.
One thing needs to be mentioned is that, right after force deregistering the volume, my Node Health recovered from (4/5) to (5/5). EFS plugin cannot be deployed to that node before deregistering.
Same issue on our v1.1.4 cluster using Ceph RBD and cephfs CSI plugins
This may not be a network issues, as even though I deployed Ceph and Nomad on the same node, mounting to localhost, the issue still persist. nomad detach works fine without any problem as network issues are now out of the way. I also used a single Nomad server, but this issue still exist. The weird thing about it is when I have three instances (zoo1, zoo2, zo3) all asking for their own volumes, only some volumes (not all) have issues. This could be a concurrency problem? Race condition perhaps?
I've noticed my AWS CSI plugins (both the node and controllers) are frequently restarting maybe 3-4 times a day. I don't know why yet. I suspect this sometimes leaves the volumes in a bad state and causes this error.
I'm hitting the same thing using seaweedfs. Ultimately i think a network issue was causing the volumes to disconnect but nomad was never releasing them internally until i deregistered/reregistered them
Any update on this? @lgfa29
Same issue on our v1.1.5 cluster using AWS EBS volumes. We have to manually force deregister and register the affected volumes. Happens very frequently.
Unfortunately it means we cannot automate Nomad deployments in a CI/CD stack, as a human being needs to be present to fix any stuck volumes.
Any update on this? @lgfa29
Hi @gregory112, no updates yet. I will let you know once we do have any progress to share here.
I have been digging through the Nomad source code and found a possible
work-around that works well for my use-case. If the volume gets registered as a
multi-node-multi-writer
volume instead of a single-node-writer
volume, the parts of the source code that I think are buggy are just
skipped. And because I set the per_alloc
flag to true
I am rather
confident that Nomad or the CSI driver will not actually try to attach a
volume that is in use to another alloc/task.
This might have some other unintended consequences but for now it allows me to deploy updated versions of the job again without having to re-register the volume every time.
@JanMa Thanks, I'll try that. I tried deregistering my volume and reregistering it with access_mode = "multi-node-multi-writer"
and set capability to multi-node-multi-writer in Nomad job volume stanza. But the output of nomad volume status
is still single-node-writer
. Maybe ceph-csi driver does not support that mode and reverts back to single-node-writer, but thank you for the work-around though, I'll try with other drivers later.
Another +1 here. We are having these issues in both dev and prod with AWS EBS.
Nomad v1.1.1 with EBS CSI driver v1.1.0.
No special update {}
stanza (using default).
As others described, it looks like Nomad "forgets" the volume is no longer allocated. In my case, I even had a job that was not running for weeks, and when I tried to redeploy, Nomad still complained about CSI volume <name> has exhausted its available writer claims
.
Inspecting volume status showed the volume as No allocations placed
and checking AWS confirmed the volume isn't mounted to any instance.
Trying to deregister the volume returns Error deregistering volume: Unexpected response code: 500 (rpc error: volume in use: <name>)
and requires -force
flag.
I'm seeing a lot of errors in logs of the form "error releasing volume claims [...] Permission denied".
ubuntu@ip-10-0-2-23:~$ journalctl -u nomad | grep "error releasing volume claims" -A 2 | tail
Nov 09 00:28:10 ip-10-0-2-23 nomad[24326]: "
Nov 09 00:36:15 ip-10-0-2-23 nomad[24326]: 2021-11-09T00:36:15.221Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=auth-scylladb[1] error="1 error occurred:
Nov 09 00:36:15 ip-10-0-2-23 nomad[24326]: * Permission denied
Nov 09 00:36:15 ip-10-0-2-23 nomad[24326]: "
Nov 09 00:48:20 ip-10-0-2-23 nomad[24326]: 2021-11-09T00:48:20.202Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=auth-scylladb[1] error="1 error occurred:
Nov 09 00:48:20 ip-10-0-2-23 nomad[24326]: * Permission denied
Nov 09 00:48:20 ip-10-0-2-23 nomad[24326]: "
Nov 09 01:08:27 ip-10-0-2-23 nomad[24326]: 2021-11-09T01:08:27.605Z [ERROR] nomad.volumes_watcher: error releasing volume claims: namespace=default volume_id=auth-scylladb[1] error="1 error occurred:
Nov 09 01:08:27 ip-10-0-2-23 nomad[24326]: * Permission denied
Nov 09 01:08:27 ip-10-0-2-23 nomad[24326]: "
ubuntu@ip-10-0-2-23:~$ nomad --version
Nomad v1.1.4 (acd3d7889328ad1df2895eb714e2cbe3dd9c6d82)
Edit: I fixed this issue by giving way too many permissions to the nodes (mount-volumes did not seem to be enough). Unfortunately I still see the parent issue happening. So this seems unrelated.
Hi all 👋 Was this issue ever fixed? I'm running nomad 1.1.6 with hetzner csi driver 1.6.0 and when it works it's great, but when it breaks it's a pure chaos - stuck controller allocation (even tho csi plugin was successfully started by docker and throws no errors), nomad throws 'exhausted claim' error once csi driver allocation succeeds - today I had to force deregister & register all volumes to get it working again.
Any way I can help debugging this?
Another +1 from here. I've tried both EBS and ceph and hit this issue. Is this related to https://github.com/hashicorp/nomad/issues/10833 @tgross @lgfa29
I don't want to be mean towards the Nomad team, but this is a really big issue in my opinion, it basically disqualifies CSI for any serious HA deployments (not that my person website is HA), I'm purposefully avoiding CSI on a few key services to make sure they don't randomly die. Just my two cents, sucks that this is an issue... Nomad is otherwise a stellar system.
+1. Nomad is truly an amazing alternative to Kubernetes. But at times I have considered switching just because of this issue. In the meantime I have stopped using CSI altogether.
My use case is simply to mount an AWS EBS volume for some databases. I can do that without CSI by running a script as a pre- and post-start lifecycle tasks. If anyone is interested, here are the scripts I use.
With CSI, tasks died every day and failed to restart. With this solution not a single task has failed.
+1 CSI at the moment can be viewed extremely unreliable because of this bug.
The Nomad team at HashiCorp hears you! Sorry we haven't shipped a fix yet, but it's on our roadmap. In the mean time I've pinned the issue to raise its visibility.
In the mean time can anyone confirm how viable the two workarounds mentioned are? I've seen:
I will say from my experience dealing with wanting to use ceph with nomad has been kinda bad.
One thing that has caused a lot of pain is following guidance in this area https://github.com/hashicorp/nomad/tree/main/demo/csi/ceph-csi-plugin, for example using the --instance-id
flag to the instance per alloc_id will cause the controller to create an omap key per alloc, such that whenever an alloc of controller dies it creates a new omap ID and cannot find previous volumes.
When coupled with the pain of "deregistering forcefully" volumes and re-registering them as a solution and the controller has changed causes it to not know in ceph where volumes are, alloc XYZ:VOLUME1 is not the same as alloc ABC:VOLUME1.
@schmichael force de-registering the volume and re-creating it is how I've been dealing with it using the Linode CSI plugin
@schmichael multi-node-multi-writer access mode isn't supported by AWS EBS so cannot be applied. As for force detach, that's really a workaround of the consequences, not really a a way to avoid the whole situation in the first place.
Cheers for raising the visibility on this issue.
using the
--instance-id
flag to the instance per alloc_id will cause the controller to create an omap key per alloc, such that whenever an alloc of controller dies it creates a new omap ID and cannot find previous volumes.
@trilom are you saying that the documentation is wrong and that flag shouldn't be used or is this bug that is preventing that from working the way it's intended to? I'm also running ceph and the information on the internet is very scarce.
I will say from my experience dealing with wanting to use ceph with nomad has been kinda bad.
One thing that has caused a lot of pain is following guidance in this area https://github.com/hashicorp/nomad/tree/main/demo/csi/ceph-csi-plugin, for example using the
--instance-id
flag to the instance per alloc_id will cause the controller to create an omap key per alloc, such that whenever an alloc of controller dies it creates a new omap ID and cannot find previous volumes.When coupled with the pain of "deregistering forcefully" volumes and re-registering them as a solution and the controller has changed causes it to not know in ceph where volumes are, alloc XYZ:VOLUME1 is not the same as alloc ABC:VOLUME1.
I still use ${node.unique.name}-nodes
instance-id for Ceph plugin node and -controller
for controller and use job type = "system"
.
Yet, I agree with you, using Ceph coupled with CSI and Nomad gave me a bad experience. On the brightside, the multi-node-multi-writer
workaround works for me currently, but still the system needs manual intervention as the whole system does not survive reboot. My whole system (Nomad + Consul + Ceph) are in the same host, for testing, and when the system is rebooted, Ceph and Nomad run normally but the plugin does not. The controller and node run but the plugin is not registered in Nomad, or if it's registered, it shows no node in nomad plugin status
, although all nodes are shown to be running in nomad job status
. Probably it hangs because the plugin starts first before Ceph, but even if it does, it stays like that forever, so I have to manually restart the job again. I am still investigating this and so have not opened a new bug yet. I wonder if anyone see the same issue as well.
So yes, I agree with you. It is quite unreliable.
using the
--instance-id
flag to the instance per alloc_id will cause the controller to create an omap key per alloc, such that whenever an alloc of controller dies it creates a new omap ID and cannot find previous volumes.@trilom are you saying that the documentation is wrong and that flag shouldn't be used or is this bug that is preventing that from working the way it's intended to? I'm also running ceph and the information on the internet is very scarce.
@JohnKiller
Check the pool you are creating volumes against with this command rados ls -p RBD_POOL | grep csi.volume
this will give you all the omaps created with the defaults. You should see one named csi.volumes.default
if you don't assign instance-id
but if you assign it then this creates an omap key for csi.volumes.INSTANCE_ID
. If you ask the omap key what it's value is with rados listomapvals -p RBD_POOL csi.volumes.default
then you can see that it relates the volume labels's(in nomad) to volume labels's(in ceph).
rados ls -p h2_nomad | grep csi.volume
csi.volumes.default
csi.volume.792614e1-4ea2-11ec-816b-0242ac110009
rados listomapvals -p h2_nomad csi.volumes.default
csi.volume.prometheus_dev
value (36 bytes) :
00000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |792614e1-4ea2-11|
00000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 |ec-816b-0242ac110009|
00000000 01 01 01 01 |000a|
00000001
Consider a scenario where the controller node changes, then this omap key is recreated and doesn't know the ID's of the volumes. My best idea to use this is if you have a pool with multiple nomad clusters claiming volumes then they both won't use csi.volumes.default
where volume ID's will reside in the same omap K/V you can have them as csi.volumes.CLUSTER1
and csi.volumes.CLUSTER2
by assigning --instance-id=CLUSTER1
. This is my experience with canary
tag of ceph-csi.
For reference, here is what I am running for node and controller. Documented on the main branch as linked before, the node doesn't require the secret nor connection to the ceph monitors. However when running them and attempting to use a volume within a job will cause an error until you give the nodes the secret directory and monitor configuration.
job "plugin-ceph-csi-node" {
priority = 94
datacenters = ["h2"]
type = "system"
update {
max_parallel = 1
}
group "cephrbd" {
network {
port "prometheus" {}
}
restart {
attempts = 5
interval = "2m"
delay = "15s"
mode = "fail"
}
service {
name = "prometheus"
port = "prometheus"
tags = ["ceph-csi-node", "cephrbd"]
check {
type = "http"
path = "/metrics"
port = "prometheus"
interval = "30s"
timeout = "5s"
check_restart {
limit = 5
grace = "60s"
ignore_warnings = false
}
}
}
task "plugin" {
driver = "docker"
config {
image = "quay.io/cephcsi/cephcsi:canary"
privileged = true
ports = ["prometheus"]
args = [
"--drivername=rbd.csi.ceph.com",
"--v=5",
"--type=rbd",
"--nodeserver=true",
"--nodeid=${node.unique.name}-${NOMAD_ALLOC_ID}",
# this will define the omap key to which volumes are bound to
# it should be unique per cluster, if it changes then volume-id's change
#"--instanceid=h2",
"--metricsport=${NOMAD_PORT_prometheus}",
"--endpoint=unix://csi/csi.sock",
"--enableprofiling" # this enables prometheus
]
mount {
type = "bind"
source = "secrets"
target = "/tmp/csi/keys"
readonly = false
}
mount {
type = "bind"
source = "ceph-csi-config/config.json"
target = "/etc/ceph-csi-config/config.json"
readonly = false
}
}
template {
destination = "ceph-csi-config/config.json"
data = <<-EOF
[{
"clusterID": "01",
"monitors": [
{{range $index, $service := service "mon.ceph-h2z-in|any"}}{{if gt $index 0}}, {{end}}"{{.Address}}"{{end}}
]
},{
"clusterID": "02",
"monitors": [
{{range $index, $service := service "mon.ceph-369-wtf|any"}}{{if gt $index 0}}, {{end}}"{{.Address}}"{{end}}
]
}]
EOF
}
csi_plugin {
id = "cephrbd"
type = "node"
mount_dir = "/csi"
}
resources {
cpu = 75
memory = 128
memory_max = 256
}
}
}
}
job "plugin-ceph-csi-controller" {
priority = 95
datacenters = ["h2"]
group "cephrbd" {
count = 1
network {
port "prometheus" {}
}
restart {
attempts = 3
interval = "2m"
delay = "15s"
mode = "fail"
}
service {
name = "prometheus"
port = "prometheus"
tags = ["ceph-csi-controller", "cephrbd"]
check {
type = "http"
path = "/metrics"
port = "prometheus"
interval = "30s"
timeout = "5s"
check_restart {
limit = 2
grace = "60s"
ignore_warnings = false
}
}
}
task "plugin" {
driver = "docker"
config {
image = "quay.io/cephcsi/cephcsi:canary"
ports = ["prometheus"]
args = [
"--drivername=rbd.csi.ceph.com",
"--type=rbd",
"--v=5", # verbose 5
"--controllerserver=true",
"--nodeid=${node.unique.name}",
"--metricsport=${NOMAD_PORT_prometheus}",
# this will define the omap key to which volumes are bound to
# it should be unique per cluster, if it changes then volume-id's change
#"--instanceid=h2",
"--endpoint=unix://csi/csi.sock",
"--enableprofiling" # this enables prometheus
]
mount {
type = "bind"
source = "secrets"
target = "/tmp/csi/keys"
readonly = false
}
mount {
type = "bind"
source = "ceph-csi-config/config.json"
target = "/etc/ceph-csi-config/config.json"
readonly = false
}
}
template {
destination = "ceph-csi-config/config.json"
data = <<-EOF
[{
"clusterID": "01",
"monitors": [
{{range $index, $service := service "mon.ceph-h2z-in|any"}}{{if gt $index 0}}, {{end}}"{{.Address}}"{{end}}
]
},{
"clusterID": "02",
"monitors": [
{{range $index, $service := service "mon.ceph-369-wtf|any"}}{{if gt $index 0}}, {{end}}"{{.Address}}"{{end}}
]
}]
EOF
}
csi_plugin {
id = "cephrbd"
type = "controller"
mount_dir = "/csi"
}
resources {
cpu = 50
memory = 96
memory_max = 192
}
}
}
}
I think my main issue that occurs now for me is what happens if a volume is unable to connect, it doesn't handle this well. Eventually a container will linger that cannot be killed (docker cannot kill it). When you check the host, there are kernel errors complaining about the connection to mon and sometimes osd.
I have not found a way to kill that process without killing the node. Investigating the node you can see that the rbd mount still exists so it must not have been unmounted and unmapped successfully. After killing the node, and deregistering and recreating the volume another node (or the same after rebooting) can claim the volume.
There is no graceful way to handle this other than working against a known good pattern of volume claims, introducing a ton of process, or switch to another workload orchestrator to have easy volumes.
Hi folks, can we split off the Ceph demo discussion to a new GitHub issue? I suspect the problem is just that the demo is only a demo, and the job spec assumed a single node (ref):
For demonstration purposes only, you can run Ceph as a single container Nomad job on the Vagrant VM managed by the Vagrantfile at the top-level of this repo.
But let's open a new issue for that so that it gets the attention it deserves!
For folks who are interested in this issue, which is about the unfortunate writer claims bug, as @schmichael said it's on our roadmap and I'm back on the case to fix this and some of the other outstanding issues around CSI to finally push it out of beta. We can't give a timeline but I'll update here and on other open issues regularly as we squash the remaining bugs. Thanks for your patience, all!
Chiming in with an answer to this question
In the mean time can anyone confirm how viable the two workarounds mentioned are? I've seen:
1. Use multi-node-multi-writer access mode - May not work with all drivers or jobs - 2. Force detach the volume:
volume status <vol>
shows No allocations placed
. The only way to recover is to deregister the volume manually.
I'm using the Ceph CSI and democratic-csi.
@tgross Point two as listed by @Ramblurr is extremely common. As an example, here is the cluster I'm using to test out csi for stuff in my home:
This is after a single cluster-wide reboot (I replaced and upgraded some switching gear). To shut down, I did the following:
Even with this care, two volumes are now useless to me (syncthing-config
and syncthing-data
) in a literal sense. I cannot deregister them as one thing thinks they're allocated, and i cannot detach them as another thing thinks they aren't.
I believe this problem is caused by a bunch of places only checking the length of the alloc map(s), but not the actual contents of said map(s).
Some examples of this:
This is a problem as there are several places in the codebase where the map is written to with a nil
value:
I haven't had a chance to dig through the logic too much, but it looks like it's expected for the gc routine to clean these up, but it seems like its pretty easy for things to end up in a permanently broken state.
As you can see in the gif, I went ahead and created another csi volume to replace the syncthing_config volume (and switched to iscsi, because why not) and I'm going to have to do the same thing for the data volume to get things back into a functional state. It also doesn't always happen, as evidenced by the unifi-data-iscsi
volume being usable. It seems like the gc routine can get "stuck", perhaps it only checks the Claims
map for state and not the Alloc
map, assuming that something else is keeping them both in sync?
Hi @dcarbone yup, that's more-or-less the area I'm digging into. See also https://github.com/hashicorp/nomad/issues/8734 and https://github.com/hashicorp/nomad/issues/10052
Just as a bump, attempting to use terraform to update-in-place my syncthing job has resulted in my new syncthing-config-iscsi
volume to become stuck forever :(
Just a heads up for folks eagerly watching this one, I've broken this work down into a handful of fixes to investigate and implement and the first is https://github.com/hashicorp/nomad/pull/11776. I'll continue to update here as I go.
Hi folks, just wanted to give an update here. I've got an environment set up where I can reproduce this bug reasonably reliably using the democratic-csi
plugin (thanks to the folks at @democratic-csi for having a nice Nomad setup there) with NFS. I've found two clues to where the bug may be:
When (*csiHook) Prerun
runs, it "claims" the volume at the server. This claim kicks off the controller RPCs, and when it returns we mount the volume via the node RPCs. If the node RPCs fail, we return an error and then the Postrun
hook is supposed to unpublish (disclaim) the volume. But the raft logs I have here show we never disclaim the volume!
Even in this scenario, I would absolutely expect the volume GC to pick up that the allocations associated with the volume are dead and mark the volume for reaping. But I don't see the unpublish RPC calls from the volume watcher either, which suggests either the GC job is not marking the volume for claim reaping correctly, or the volume watcher's logic is wrong. I have actual raft states here that failed this condition, so as long as it's not weirdly racy it's probably possible for me to feed that through some tests to figure out what's going wrong.
@tgross ~perhaps in the short term we could have a way to "force" detach volumes? I'm racking up quite a number of them :(~
FWIW, a very reliable way of triggering this problem is to try to update a job in-place. Update a template
stanza, or one of the service.tags
values. When I try to do this with Terraform, there is around a 1/3 chance that it'll cause at least one of the volumes to become "stuck".
Edited as @trilom's suggestion works perfectly :)
Nomad version
Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)
Operating system and Environment details
Ubuntu 20.04 LTS
Issue
Cannot re-plan jobs due to CSI volumes being claimed. I have seen many variations about this issue. I don't know how to debug it. I use ceph-csi plugin to deploy system job on my two Nomad nodes. This result in two controllers and two ceph-csi nodes. I then create a few volumes using
nomad volume create
command. I then create a job with three tasks that use three volumes. Sometimes, after a while the job may fail, and I stop it. After that when I try to replan the exact same job I get that error.What confuses me is the warning. It differs every time I run
job plan
. First I sawThen, runnig
job plan
again a few seconds after, I gotThen again,
I have three groups: zookeeper1, zookeeper2, and zookeeper3, each using two volumes (data and datalog). I will just assume from this log that all volumes are non-reclaimable.
This is the output of
nomad volume status
.It says that they are schedulable. This is the output of
nomad volume status zookeeper1-datalog
:It says there, there are no allocations placed.
Reproduction steps
This is unfortunately flaky. But most likely happen due to job failing and then stopped and then replanned. This persists even after I purge the job with
nomad job stop -purge
. No, doingnomad system gc
,nomad system reconcile summary
, or restarting Nomad does not work.Expected Result
Should be able to reclaim the volume again without having to detach or deregister -force and register again. I created the volumes using
nomad volume create
so those volumes have their external IDs all generated. There are 6 volumes and 2 nodes, I don't want to type detach 12 times everytime this happens (this happens so frequently).Actual Result
See error logs above.
Job file (if appropriate)
I have three groups (zookeeper1, zookeeper2, zookeeper3) each having volume stanza like this (each with their own volumes, this one is for zookeeper2):
All groups have
count = 1
.