Open SacDin opened 3 years ago
This needs some help. I have not used Nomad yet. I suppose it should be fairly simple though to have it work with Nomad.
Hi, I tried to adopt this a couple of days ago and here is what I found, admittedly being quite new to the topic of CSI volumes. But here is what I found, maybe it helps:
The Seaweed Driver itself works and is recognized by Nomad with a configuration similar to this:
job "csi-plugin-seaweedfs" {
datacenters = ["dc1"]
# type system means that the job is deployed on every nomad client node
type = "system"
group "nodes" {
task "plugin" {
driver = "docker"
config {
image = "chrislusf/seaweedfs-csi-driver:latest"
args = [
"--endpoint=unix://csi/csi.sock",
# this adress is resolved by consul dns to point to a filer service; probably pointing to localhost
# and running filer on every node would be more adequate
"--filer=seaweedfs-filer.service.consul:8888",
"--nodeid=${node.unique.name}",
"--cacheCapacityMB=1000",
"--cacheDir=/tmp",
]
privileged = true
}
csi_plugin {
id = "seaweedfs"
type = "node"
mount_dir = "/csi"
}
}
}
}
There are three types of CSI plugins. Controller Plugins communicate with the storage provider's APIs. For example, for a job that needs an AWS EBS volume, Nomad will tell the controller plugin that it needs a volume to be "published" to the client node, and the controller will make the API calls to AWS to attach the EBS volume to the right EC2 instance. Node Plugins do the work on each client node, like creating mount points. Monolith Plugins are plugins that perform both the controller and node roles in the same instance. Not every plugin provider has or needs a controller; that's specific to the provider implementation.
Correct me if I am wrong, but the way I see it for this to work, seaweed-csi-driver would need to implement the complete "controller" mode part of the CSI Spec, while avoiding explicit Kubernetes dependencies. However, afaik seaweed could become the first driver to provide a fully functional CSI implementation for Nomad, without introducing any external dependencies, providing the storage from the Nomad cluster itself.
Anyway, If you have some thoughts or suggestions, I will be happy to try them out, since I couldn't find a satisfying solution to this problem as of now.
EDIT: I just have had a look into the spec and found that the controller interface is already implemented? I will try this out right now...
Ok, so I think this is actually working. I have just been able to register a volume with the following job spec:
job "seaweedfs" {
datacenters = ["dc1"]
type = "system"
group "seaweed" {
network {
mode = "bridge"
port "master" {
host_network = "vpn"
static = 9333
to = 9333
}
port "master-grpc" {
host_network = "vpn"
static = 19333
to = 19333
}
port "volume" {
host_network = "vpn"
static = 8889
to = 8889
}
port "volume-grpc" {
host_network = "vpn"
static = 18889
to = 18889
}
port "filer" {
host_network = "vpn"
static = 8888
to = 8888
}
port "filer-grpc" {
host_network = "vpn"
static = 18888
to = 18888
}
port "s3" {
host_network = "vpn"
}
}
service {
name = "seaweedfs-filer"
port = "filer"
check {
type = "http"
path = "/status"
port = "filer"
interval = "10s"
timeout = "2s"
}
}
volume "seaweed" {
type = "host"
source = "seaweed"
}
task "master" {
driver = "docker"
config {
image = "chrislusf/seaweedfs"
args = [
"master",
"-port=${NOMAD_PORT_master}"
]
ports = ["master", "master-grpc"]
}
}
task "volume" {
driver = "docker"
config {
image = "chrislusf/seaweedfs"
args = [
"volume",
"-mserver=:${NOMAD_PORT_master}",
"-dir=/data",
"-port=${NOMAD_PORT_volume}"
]
ports = ["master", "master-grpc", "volume", "volume-grpc"]
# command: 'volume -mserver="master:9333" -port=8080-metricsPort=9325'
}
volume_mount {
volume = "seaweed"
destination = "/data"
}
}
task "filer" {
driver = "docker"
config {
image = "chrislusf/seaweedfs"
args = [
"filer",
"-master=:${NOMAD_PORT_master}",
"-port=${NOMAD_PORT_filer}"
]
ports = ["master", "master-grpc", "filer", "filer-grpc"]
# command: 'filer -master="master:9333" -metricsPort=9326'
tty = true
interactive = true
}
}
task "s3" {
driver = "docker"
config {
image = "chrislusf/seaweedfs"
args = [
"s3",
"-filer=:${NOMAD_PORT_filer}",
"-port=${NOMAD_PORT_s3}"
]
ports = ["master", "master-grpc", "filer", "filer-grpc"]
}
lifecycle {
hook = "poststart"
sidecar = true
}
}
task "cronjob" {
driver = "docker"
config {
image = "chrislusf/seaweedfs"
args = [
"cronjob"
]
}
env {
# Run re-replication every 5 minutes
CRON_SCHEDULE="*/5 * * * * *"
WEED_MASTER=":${NOMAD_PORT_master}"
}
lifecycle {
hook = "poststart"
sidecar = true
}
}
task "plugin" {
driver = "docker"
config {
image = "chrislusf/seaweedfs-csi-driver:latest"
args = [
"--endpoint=unix://csi/csi.sock",
"--filer=:8888",
"--nodeid=${node.unique.name}",
"--cacheCapacityMB=1000",
"--cacheDir=/tmp",
]
privileged = true
}
csi_plugin {
id = "seaweedfs"
type = "monolith"
mount_dir = "/csi"
}
}
}
}
I am not completely sure about the whole port configuration thing. However, I managed to connect with the weed cli and I managed to create a volume with nomad volume create volume.hcl
.
# volume.hcl
id = "test-volume"
name = "test-volume"
type = "csi"
plugin_id = "seaweedfs"
# dont try to set this to less than 1GiB
capacity_min = "5GiB"
capacity_max = "8GiB"
capability {
access_mode = "single-node-reader-only"
attachment_mode = "file-system"
}
capability {
access_mode = "single-node-writer"
attachment_mode = "file-system"
}
# Optional: for 'nomad volume create', specify mount options to validate for
# 'attachment_mode = "file-system". Registering an existing volume will record
# but ignore these fields.
mount_options {
mount_flags = ["rw"]
}
Also this doesn't join the different nodes in one seaweed cluster yet, so it would need some more tweaking. However, I hope this helps.
@knthls Thanks a lot for the detailed investigation!
Seems this job spec did two things: creating a SeaweedFS cluster and setup CSI driver to connect to it. Separating this job spec into two would make things simpler.
For setting up SeaweedFS cluster
btw: we can create wiki pages for this.
@chrislusf I would be happy to share my configuration once it is proven to work well. Of course you're completely right, the S3 and crontab tasks are not needed, although the S3 api will probably become part of my personal setup in the future. Also you're right, that separating the jobs is a good idea.
For giving a recommendation for a working setup I think I need to figure out some more details:
For a complete example I think it would also be a good thing to think about update strategies. Is there a health-check for the CSI Plugin? What happens if the plugin gets restarted, while there are volumes mounted? Probably they should be unmounted first... Also what happens if filer and / or the volume server gets restarted, while the CSI plugin is connected? Is it able to deal with such a situation?
The CSI driver is a wrapper for FUSE mount. If the filer is restarted, it will try to reconnect.
Other questions are related to setup SeaweedFS cluster on Nomad. I suppose it should be similar to how HELM does it. https://github.com/chrislusf/seaweedfs/tree/master/k8s/seaweedfs
Data persistence: I noticed that when I restart the job, all the data becomes unavailable. I think this is because in this config, the metadata-stores are not persisted.
Yep seeing this as well, everything else works fine with just CSI volume mapping actually but if the seafilecsi plugin job is killed the volumes become unresponsive
Granted its probably a somewhat extreme situation, except maybe for CSI driver updates?
The CSI driver basically runs weed mount
, which has some internal cache.
So if the SeaweedFS cluster is refreshed, the CSI driver should also restart.
@danlsgiga I noticed in https://github.com/seaweedfs/seaweedfs-csi-driver/pull/12 you already have it working with Nomad.
Do you want to share your configuration setup here or add it to the README?
What I'm seeing is that when the CSI driver pod is restarted the mount goes bad (the cluster is not on the mount it's on an external machine) You can see in the SS when I restarted the driver as i was accessing the mount directly
It's also entirely possible Nomad itself is doing something and it's totally unrelated to the CSI driver at all so I could just be contributing noise.
With that said the job config in the comment by @knthls https://github.com/seaweedfs/seaweedfs-csi-driver/issues/31#issuecomment-898400526 works great for basic volumes on an external seaweed cluster.
In the meantime I think I've got it to work and so far haven't noticed any data loss and also mounts have been working well. It also survived Node drains and single node reboots. This is the spec I use:
job "plugin-seaweedfs" {
datacenters = ["dc001"]
type = "system"
update {
max_parallel = 1
stagger = "60s"
}
group "nodes" {
network {
mode = "bridge"
port "csi" {
host_network = "vpn"
}
}
task "plugin" {
driver = "docker"
config {
image = "chrislusf/seaweedfs-csi-driver:latest"
args = [
"--endpoint=unix://csi/csi.sock",
"--filer=${NOMAD_IP_csi}:8888",
"--nodeid=${node.unique.name}",
"--cacheCapacityMB=2000",
"--cacheDir=/tmp",
]
privileged = true
}
csi_plugin {
id = "seaweedfs"
type = "monolith"
mount_dir = "/csi"
}
}
}
}
Notice a few details:
While my learning with Nomad goes on, I would suggest that it maybe is a good idea to put the cache dir to /alloc/data/tmp
and add an ephemeral_disk
stanza. This way, the cache would also survive updates.
If someone would be willing to contribute, I could share my entire seaweed cluster setup, maybe employing levant to customize some of the settings in a seperate repository. We could set up something analogue to a helm chart for nomad.
@knthls for the SeaweedFS cluster setup by Nomad, could you please share your setup in this folder? https://github.com/chrislusf/seaweedfs/tree/master/k8s
We can have 2 folders there, helmcharts and nomad. I can adjust the existing directory structure to move existing Helm charts accordingly.
For CSI Driver specific nomad setup, do you think adding to the README.md is enough?
@knthls Any chance you can share the complete HCL file(s)? Do you use the S3 API as well?
Hi @chrislusf, sorry I just saw your mention regarding this... due to unforeseen circumstances I'll only be able to revisit things in January but I'll gladly help out here.
Also agree, we should have some code and docs in place to help out folks running on Nomad. One thing that I found problematic on my setup is that I was running multi master and service discovery was a challenge since I had to restart all masters when any of them changes IP (aka: static IPs provided via flags are problematic on dynamic environments). If there's a better way to do that it would be awesome for the Nomad use case.
speaking specifically for Nomad, if you have a way to reload the master on the fly when the config file is updated, that should do it for the multi master running on Nomad use case. I'll provide more info when I'm able to by January.
@knthls Any chance you can share the complete HCL file(s)? Do you use the S3 API as well?
Hi, sorry, my focus has shifted away from this at the Moment. I have been a little lost there, so I can't guarantee that this is in a consistent state. Anyway, if it helps, my attempts went as far as this:
job "plugin-seaweedfs" {
datacenters = ["cwx001"]
type = "system"
update {
max_parallel = 1
stagger = "60s"
}
group "nodes" {
network {
mode = "host"
}
task "plugin" {
driver = "docker"
config {
image = "chrislusf/seaweedfs-csi-driver:latest"
args = [
"--endpoint=unix://csi/csi.sock",
"--filer=${attr.unique.hostname}.seaweedfs-filer.service.consul:8888",
"--nodeid=${attr.unique.hostname}",
"--cacheCapacityMB=1024",
"--cacheDir=/alloc/data"
]
privileged = true
}
csi_plugin {
id = "seaweedfs"
type = "monolith"
mount_dir = "/csi"
}
resources {
cpu = 100
memory = 256
memory_max = 2048
}
}
}
}
job "system-seaweedfs-master" {
datacenters = ["cwx001"]
type = "service"
group "seaweed-master" {
count = 3
constraint {
distinct_hosts = true
}
update {
max_parallel = 1
stagger = "5m"
canary = 1
}
network {
mode = "bridge"
port "master" {
host_network = "vpn"
static = 9333
}
port "master-grpc" {
host_network = "vpn"
static = 19333
}
}
ephemeral_disk {
migrate = true
size = 5000
sticky = true
}
task "find-peers" {
driver = "exec"
config {
command = "/bin/bash"
args = ["local/find-peers.sh"]
}
service {
name = "pending-seaweed-master"
port = "master"
}
lifecycle {
hook = "prestart"
sidecar = false
}
env {
N_PEERS = 3
}
template {
data =<<-EOF
# check if weed servers are already running
if wget -q -O status seaweed-master.service.consul:9333/cluster/status?pretty=y; then
grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b" status > ${NOMAD_ALLOC_DIR}/data/peers | \
tr -d "\n"
exit 0
fi
# else wait for pending peers
peers () {
dig +short pending-seaweed-master.service.${NOMAD_DC}.consul
}
weed_peers=$( peers )
while [ $( echo $weed_peers | wc -w) -lt $N_PEERS ]; do
echo "wait for peers ($weed_peers known)"
sleep 1
weed_peers=$( peers )
done
echo "found $weed_peers"
echo $weed_peers > ${NOMAD_ALLOC_DIR}/data/peers
sleep 5
EOF
destination = "local/find-peers.sh"
}
}
task "master" {
driver = "docker"
config {
image = "chrislusf/seaweedfs"
entrypoint = [
"/bin/ash",
"local/entrypoint.sh"
]
ports = ["master", "master-grpc"]
}
template {
data = <<-EOF
private_ip=$(cat /etc/hosts | tail -n 1| cut -d" " -f 1)
peers=$(printf "%s:9333\n" $(cat ${NOMAD_ALLOC_DIR}/data/peers) | \
grep -v $NOMAD_IP_master | paste -sd "," -)
weed -v=1 master \
-port=${NOMAD_PORT_master} \
-ip=${NOMAD_IP_master} \
-ip.bind=$private_ip \
-peers=$peers \
-defaultReplication=010 \
-mdir=/alloc/data
EOF
destination = "local/entrypoint.sh"
}
service {
name = "seaweed-master"
port = "master"
tags = [attr.unique.hostname]
check {
type = "http"
port = "master"
path = "/cluster/status"
interval = "10s"
timeout = "2s"
on_update = "ignore"
}
}
}
}
}
job "system-seaweedfs-filer" {
datacenters = ["cwx001"]
type = "system"
group "seaweed-filer" {
ephemeral_disk {
size = 2000
sticky = true
}
network {
mode = "bridge"
port "filer" {
host_network = "vpn"
static = 8888
to = 8888
}
port "filer-grpc" {
host_network = "vpn"
static = 18888
to = 18888
}
}
task "filer" {
driver = "docker"
config {
image = "chrislusf/seaweedfs"
entrypoint = [
"ash", "local/entrypoint"
]
ports = ["filer", "filer-grpc"]
tty = true
interactive = true
volumes = [
"local/filer.toml:/etc/seaweedfs/filer.toml"
]
}
template {
data = <<-EOF
[leveldb3]
enabled = true
dir = "{{ env "NOMAD_ALLOC_DIR" }}/data/filerldb3"
EOF
destination = "/local/filer.toml"
}
template {
# entrypoint
data = <<-EOF
# find internal ip to bind to
private_ip=$(cat /etc/hosts | tail -n 1| cut -d" " -f 1)
master_url="seaweed-master.service.${NOMAD_DC}.consul"
masters=$(
nslookup -query=SRV $master_url | \
grep $master_url | \
tr -s " " | \
cut -d " " -f 5,6 | \
sed -e 's/\([[:digit:]]\{1,4\}\)\s\(.*\)/\2:\1/g'
)
masters=$(echo $masters | tr " " ",")
echo "found the following seaweed master urls: $masters"
# find peers / volume servers
volume_service_url=seaweed-volume.service.${NOMAD_DC}.consul
filer_servers=$(nslookup -type=A $volume_service_url | grep -A1 "Name" | \
grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b")
filer_servers=$(echo "$filer_servers" | \
sed -e "s/\(.*\)/\1:$NOMAD_PORT_filer/g" | grep -v $NOMAD_IP_filer)
echo "found peers: "
echo "$filer_servers"
filer_servers=$(echo $filer_servers | tr " " ",")
weed filer \
-master $masters \
-ip $private_ip \
-port ${NOMAD_PORT_filer} \
-rack=${NOMAD_NODE} \
-dataCenter=${NOMAD_DC} \
-peers=$filer_servers \
-defaultReplicaPlacement=010 \
-maxMB=32
EOF
destination = "local/entrypoint"
}
env {
NOMAD_NODE = node.unique.name
}
service {
name = "seaweed-filer"
port = "filer"
tags = [attr.unique.hostname]
check {
type = "http"
port = "filer"
path = "/"
interval = "30s"
timeout = "2s"
on_update = "ignore"
}
}
resources {
cpu = 200
memory = 256
memory_max = 2048
}
}
}
}
job "system-seaweedfs-volume" {
datacenters = ["cwx001"]
type = "system"
group "seaweed-volume" {
network {
mode = "bridge"
port "volume" {
host_network = "vpn"
static = "8889"
}
port "volume-grpc" {
host_network = "vpn"
static = "18889"
}
}
update {
max_parallel = 1
stagger = "60s"
}
volume "seaweed" {
type = "host"
source = "seaweed"
}
task "volume" {
driver = "docker"
config {
image = "chrislusf/seaweedfs"
entrypoint = ["ash", "/local/entrypoint"]
ports = ["volume", "volume-grpc"]
}
template {
data = <<-EOF
master_url="seaweed-master.service.${NOMAD_DC}.consul"
master_servers=$(nslookup -type=A $master_url | grep -A1 "Name" | \
grep -oE "\b([0-9]{1,3}\.){3}[0-9]{1,3}\b")
master_servers=$(echo "$master_servers" | \
sed -e "s/\(.*\)/\1:9333/g")
master_servers=$(echo $master_servers | tr " " ",")
echo "found the following seaweed master urls: $master_servers"
weed -v 1 volume \
-mserver $master_servers \
-publicUrl ${NOMAD_IP_volume}:${NOMAD_PORT_volume} \
-ip ${NOMAD_NODE}.seaweed-volume.service.consul \
-ip.bind 0.0.0.0 \
-dir /data \
-port ${NOMAD_PORT_volume} \
-port.public ${NOMAD_PORT_volume} \
-minFreeSpace 10 \
-dataCenter=${NOMAD_DC} \
-rack=${NOMAD_NODE} \
-max 60 \
-compactionMBps=50 \
-readMode=proxy \
-index leveldb \
-fileSizeLimitMB=1024
EOF
destination = "local/entrypoint"
}
volume_mount {
volume = "seaweed"
destination = "/data"
}
env {
NOMAD_NODE = node.unique.name
}
service {
name = "seaweed-volume"
port = "volume"
tags = [attr.unique.hostname]
check {
type = "http"
port = "volume"
path = "/status"
interval = "10s"
timeout = "2s"
on_update = "ignore"
}
}
resources {
cpu = 1000
memory = 2048
}
}
}
}
A lot of this isn't exactly essential and I have had problems, with unreachable volumes, data loss and general instability. I hope this helps, but I am out for now.
@chrislusf I got it all working on my homelab but I had to build seaweedfs-csi-driver
myself to get an arm64
build. I see you have the chrislusf/seaweedfs-csi-driver:dev
build on arm64
already. Any chances to get chrislusf/seaweedfs-csi-driver:latest
on arm64
too?
Once that is done I can share all Nomad jobs here!
@danlsgiga thanks! I have adjusted the build process and now the latest
version has an arm64 version.
Thanks @chrislusf... I'm going to run the latest
tag here and once I validate it all works I can push a PR with the working nomad job. What is the best path for that?
Thanks! Maybe create a nomad directory here? https://github.com/seaweedfs/seaweedfs-csi-driver/tree/master/deploy
I have created MR with example how to deploy seaweedfs on Nomad. It's base on @knthls.
When I run multiple batch jobs, CSI driver creates a new directory in cacheDir for every mount. Problem is these cache directories are not removed after unmount. This causes out of disk.
I have created issue https://github.com/chrislusf/seaweedfs/issues/2923
Hi ! It looks like filer needs a postgres instance, (understandably to store file metadata). In the nomad deployment folder, it is deployed on a separate new node.
Is it possible to use the existing seaweedfs store to create the instance, instead of maintaining yet another data storage node ?
@blmhemu I would recommend not to store PostgreSQL on seaweedfs.
A lot safer is to use host_volume
with patroni. This way you will have HA with automatic fail-over and native disk performance.
@blmhemu I would recommend not to store PostgreSQL on seaweedfs.
@nahsi Is this a general recommendation or just for filer ? The reason, after I had setup filer (with postgres on host volume), I am having issues running another postgres instance (for another app) on seaweedfs. See https://github.com/seaweedfs/seaweedfs-csi-driver/issues/64
@blmhemu a general recommendation. It is always better to run databases on host volumes if you can (or on volumes provided by AWS EBS or similar).
But with Seaweedfs especially if you are running postgres with seaweedfs-csi volume be prepared for data corruption. Seaweefs-csi uses FUSE, if anything happens to seaweedfs-csi (Nomad client restart, docker restart, OOM) mount will be lost and data corruption will happen.
Running on CEPH (since CEPH CSI using Kernel driver not FUSE) is acceptable if you fine with low TPS.
Hi, is there any plan to provide support for Harshicorp Nomad ? Nomad doc mentions that any of kubernetes CSI will work out of box with Nomad, but I am confused about how seaweedfs CSI will fit in with Nomad. Any suggestions will be helpful.