seaweedfs / seaweedfs-csi-driver

SeaweedFS CSI Driver https://github.com/seaweedfs/seaweedfs
Apache License 2.0
210 stars 50 forks source link

rpc error: code = Unavailable desc = transport is closing => Transport Endpoint Not Connected - using Seaweedfs with Nomad #147

Open Lukas8342 opened 9 months ago

Lukas8342 commented 9 months ago

Hello,

I'm encountering an issue that I'm unsure whether it stems from SeaweedFS, SeaweedFS-csi-driver or HashiCorp Nomad. I'm reaching out here as a starting point, hoping for guidance as my troubleshooting options are running thin. In my current setup, I have one master, one filer, and one volume server, all running on the same machine with these configurations:

bash

weed master -ip=85.215.193.71 -ip.bind=0.0.0.0 -mdir=/seatest/m -port=9333 -port.grpc=19333
weed volume -mserver=85.215.193.71:9333 -dir=/seatest/d -dataCenter=dc1 -ip=85.215.193.71 -max=30 -ip.bind=0.0.0.0 -port=8080 -port.grpc=18080
weed filer ip=85.215.193.71 -master=85.215.193.71:9333 -dataCenter=dc1 -rack=rack1

When utilizing the CSI with the following Nomad job:

hcl

job "seaweedfs-plugin" {
  datacenters = ["dc1"]
  type        = "system"

  constraint {
    operator = "distinct_hosts"
    value    = true
  }

  group "nodes" {
    task "plugin" {
      driver = "docker"

      config {
        image      = "chrislusf/seaweedfs-csi-driver"
        privileged = true

        args = [
          "--endpoint=unix://csi/csi.sock",
          "--filer=10.7.230.11:8888",
          "--nodeid=${node.unique.name}",
          "--cacheCapacityMB=1000",
          "--cacheDir=/tmp",
        ]
      }

      csi_plugin {
        id        = "seaweedfs"
        type      = "monolith"
        mount_dir = "/csi"
      }
    }
  }
}

It initially appears to work, but upon running jobs with different images, I consistently encounter a "Transport endpoint is not connected" error.

The filer logs display the following when starting a job and mounting it to a volume:

bash

I1208 15:45:26.862162 filer_grpc_server_sub_meta.go:268 => client mount@172.17.0.2:52516: rpc error: code = Unavailable desc = transport is closing
E1208 15:45:26.862195 filer_grpc_server_sub_meta.go:78 processed to 2023-12-08 15:45:26.861541202 +0000 UTC: rpc error: code = Unavailable desc = transport is closing
I1208 15:45:26.862584 filer_grpc_server_sub_meta.go:312 -  listener mount@172.17.0.2:52516 clientId -399912238 clientEpoch 2
I1208 15:45:26.862933 filer_grpc_server_sub_meta.go:296 +  listener mount@172.17.0.2:54900 clientId -1540680978 clientEpoch 2
I1208 15:45:26.862949 filer_grpc_server_sub_meta.go:36  mount@172.17.0.2:54900 starts to subscribe /buckets/dat from 2023-12-08 15:45:26.862037157 +0000 UTC

Nomad volume mounting is done as follows:

hcl

job "sonatype-nexus" {
  datacenters = ["dc1"]

  group "nexus" {
    count = 1
    network {
      port "http" {
        static = 8081
      }
    }

    volume "vol" {
      type           = "csi"
      read_only      = false
      source         = "nexus-volume"
      access_mode    = "single-node-writer"
      attachment_mode = "file-system"
    }

    task "server" {
      driver = "docker"
      volume_mount {
        volume      = "vol"
        destination = "/nexus-data"
        read_only   = false
      }

      config {
        image = "sonatype/nexus3:latest"
        ports = ["http"]
      }

      resources {
        cpu    = 2000
        memory = 4000
      }
    }
  }
}

I appreciate any insights or guidance you can provide to help resolve this issue.

Thank you.

worotyns commented 8 months ago

Same here on nomad, csi-plugin logs also: panic: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined


On image : chrislusf/seaweedfs-csi-driver:v1.1.8

Works fine :) - It's look like this changes cause a problem: https://github.com/seaweedfs/seaweedfs-csi-driver/commit/785e69a08ef47eab94742b040870ec0716f20f13

qskousen-membersy commented 7 months ago

Can confirm that latest image is broken for me on Nomad with the error messages referencing Kubernetes, and that using v1.1.8 as @worotyns suggested works.

chrislusf commented 6 months ago

@duanhongyi please take a look here.

duanhongyi commented 6 months ago

@chrislusf It seems to be incompatible with Nomad. KUBERNETES-SERVICE_HOST env does not exist in Nomad.

Let me take a look in the next few days.

nahsi commented 3 months ago

Still broken in the latest version.

I think this commit has completely broke this CSI driver https://github.com/seaweedfs/seaweedfs-csi-driver/commit/785e69a08ef47eab94742b040870ec0716f20f13#diff-d7f330f6d6efcabc25613925c10237045948e05bc020c7ecf16c3b331e371e62

chrislusf commented 3 months ago

Send a PR to revert this change?

duanhongyi commented 3 months ago

@chrislusf

I think it can be downgraded, that is, Nomad CSI does not support limited capacity.

Is this feasible? This modification is the simplest. I currently do not have a Nomad cluster to experiment with.

The pseudocode is as follows, mainly looking at the maxVolumeSize variable:

func GetVolumeCapacity(volumeId string) (int64, error) {
    client, err := NewInCluster()
    if err != nil {
        return maxVolumeSize, nil
    }
    ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer cancel()

    if volume, err := client.CoreV1().PersistentVolumes().Get(ctx, volumeId, metav1.GetOptions{}); err != nil {
        return 0, err
    } else {
        storage := volume.Spec.Capacity.Storage()
        capacity, _ := storage.AsInt64()
        return capacity, nil
    }
}
duanhongyi commented 3 months ago

I have looked at Nomad's API and it is not the standard K8S API; So the simplest way to fix it is to ignore obtaining the capacity of nomad PVC and directly return the maximum value.

https://developer.hashicorp.com/nomad/api-docs/volumes

If this is feasible, I will submit a PR tomorrow.

duanhongyi commented 3 months ago

168