nfs: Failed to resolve server nfs-server.default.svc.cluster.local: Name or service not known

fhaifler commented 5 years ago

BUG REPORT

Environment:

Minikube version: v0.30.0

OS: Fedora 29
VM Driver: virtualbox, kvm2
ISO version: v0.30.0
Others:
- kubernetes version: tested on v1.10.0, v1.13.0
- tested with coredns and kube-dns minikube addons

What happened: NFS volume fails to mount due to DNS error (Failed to resolve server nfs-server.default.svc.cluster.local: Name or service not known). This problem does not occur when deployed on GKE.

What you expected to happen: NFS volume is mounted without an error.

How to reproduce it (as minimally and precisely as possible):

Start nfs-server:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: nfs-server
spec:
replicas: 1
selector:
matchLabels:
  role: nfs-server
template:
metadata:
  labels:
    role: nfs-server
spec:
  containers:
  - name: nfs-server
    image: gcr.io/google_containers/volume-nfs:0.8
    ports:
    - name: nfs
      containerPort: 2049
    - name: mountd
      containerPort: 20048
    - name: rpcbind
      containerPort: 111
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /exports
      name: exports
  volumes:
  - name: exports
    emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: nfs-server
spec:
ports:
- name: nfs
port: 2049
- name: mountd
port: 20048
- name: rpcbind
port: 111
selector:
role: nfs-server

Start service consuming the nfs volume (e.g. busybox):

apiVersion: v1
kind: ReplicationController
metadata:
name: nfs-busybox
spec:
replicas: 1
selector:
name: nfs-busybox
template:
metadata:
  labels:
    name: nfs-busybox
spec:
  containers:
  - image: busybox
    command:
      - sh
      - -c
      - 'while true; do date > /mnt/index.html; hostname >> /mnt/index.html; sleep $(($RANDOM % 5 + 5)); done'
    imagePullPolicy: IfNotPresent
    name: busybox
    volumeMounts:
      - name: nfs
        mountPath: "/mnt"
  volumes:
  - name: nfs
    nfs:
      server: nfs-server.default.svc.cluster.local
      path: "/"

Output of minikube logs (if applicable): In kubectl describe pod nfs-busybox-... is this error:

  Warning  FailedMount  4m    kubelet, minikube  MountVolume.SetUp failed for volume "nfs" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/ab2e9ad4-f88b-11e8-8a56-4004c9e1505b/volumes/kubernetes.io~nfs/nfs --scope -- mount -t nfs nfs-server.default.svc.cluster.local:/ /var/lib/kubelet/pods/ab2e9ad4-f88b-11e8-8a56-4004c9e1505b/volumes/kubernetes.io~nfs/nfs
Output: Running scope as unit: run-r23cae2998bf349df8046ac3c61bfe4e9.scope
mount.nfs: Failed to resolve server nfs-server.default.svc.cluster.local: Name or service not known

Which indicates problem with DNS resolution for nfs-server.default.svc.cluster.local.

Note: The NFS is mounted successfully when specified by ClusterIP instead of domain name.

Anything else do we need to know: The same problem was reported already for previous version #2218, but it is closed due to inactivity of the author and no-one seems to really looked into it. There is a workaround for this, but it is required to do it every time a minikube VM is created.

When running kubectl exec -ti nfs-busybox-... -- nslookup nfs-server.default.svc.cluster.local:

Server:         10.96.0.10
Address:        10.96.0.10:53

Name:   nfs-server.default.svc.cluster.local
Address: 10.105.22.251

*** Can't find nfs-server.default.svc.cluster.local: No answer

Where strangely the service ClusterIP is present (when using kube-dns the service ClusterIP part is missing completely).

tamalsaha commented 5 years ago

Have you seen https://github.com/kubernetes/minikube/issues/2218#issuecomment-436821733 ?

fhaifler commented 5 years ago

@tamalsaha Yes, I have seen it, but there has been posted only a workaround for the issue, not an actual fix.

remohoeppli commented 5 years ago

We have the same issue:

Having error message from pod: Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/7940ceed-ffad-11e8-890b-005056010f5a/volumes/kubernetes.io~nfs/pv-nfs-10gi --scope -- mount -t nfs ext-nfs-svc.default.svc.cluster.local:/data/nfs/test /var/lib/kubelet/pods/7940ceed-ffad-11e8-890b-005056010f5a/volumes/kubernetes.io~nfs/pv-nfs-10gi Output: Running scope as unit: run-r3a24d6989c5d4e0c99d4b0eb5429a210.scope mount.nfs: Failed to resolve server ext-nfs-svc.default.svc.cluster.local: Name or service not known

Eventhough resolving works as expected: kubectl exec -it busybox -- nslookup ext-nfs-svc.default.svc.cluster.local

Answer is: `Server: 10.96.0.10 Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: ext-nfs-svc.default.svc.cluster.local Address 1: 10.96.152.237 ext-nfs-svc.default.svc.cluster.local`

Using the ip for nfs connection works as described above.

tstromberg commented 5 years ago

I suspect this is because NFS on the host system doesn't currently point to 10.96.0.10 within the guest VM - only within pods for what appears to be obsolete historical reasons. I could be completely wrong though.

remohoeppli commented 5 years ago

I guess you are right. Defining the IP for ext-nfs-svc.default.svc.cluster.local on the cluster-workers hosts file does solve the problem. Somehow it seems that the nfs mounting does not use the cluster internal dns resolution and also does not really use the external ip defined in the service. I'm not sure if this is the expected behaviour but to me it does not make much sense.

bondarewicz commented 5 years ago

👀

astleychen commented 5 years ago

well, I'm running into the same issue on EKS as well. By defining the nfs server IP directly, it just works. Is it a known issue on EKS as well? or probably should I go to EFS on AWS? :(

ikkerens commented 5 years ago

Apologies, I'm not a Minikube user but this is the most apt issue I've found for the problems that I'm having.

I'm experiencing these exact problems:

NFS-mounting by the internal domain (nfs-server.default.svc.cluster.local) doesn't work during ContainerCreating phase
Using the service IP does work.
Setting up a busybox pod, and using nslookup in there resolves the domain just fine.

Based on my googling efforts so far, this seems to be a Kubernetes issue where the NFS is being set up before the container can reach coredns. Perhaps an initialization order problem?

remohoeppli commented 5 years ago

The problem is that the components responsible for NFS storage backends do not use the cluster internal DNS but try to resolve the NFS server with the DNS information given on the worker node itself. One way to make this work would be to do a hosts-file entry on the worker nodes using (nfs-server.default.svc.cluster.local) and the nfs-server's ip address. But this is just a quick and dirty hack-around.

But it's just odd that this component is not able to use the cluster internal DNS resolution. This would make much more sense and be more intuitive to use.

bmbferreira commented 5 years ago

well, I'm running into the same issue on EKS as well. By defining the nfs server IP directly, it just works. Is it a known issue on EKS as well? or probably should I go to EFS on AWS? :(

I'm also having this issue on EKS.

remohoeppli commented 5 years ago

I don't think it's an issue related to any specific kubernetes cloud solution, but a general one.

ikkerens commented 5 years ago

From what I can tell, the only solution to this would be to have the k8s node have access to k8s's coredns, which is responsible for resolving these names. However in my experience most k8s nodes use their own dns independent of k8s.

remohoeppli commented 5 years ago

@ikkerens I'm pretty sure that would work. Having an Ingress for the kube-dns service which is only reachable from the k8s-nodes itself could achieve this. But as you said, one would have to change the dns settings on the nodes.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

rjohnson3 commented 4 years ago

/remove-lifecycle stale

zvonkok commented 4 years ago

I have the same issue on AWS with an NFS server backed by an EBS disk. Using the IP addr, works just fine. The nfs server name cannot be resolved.

dafrenchyman commented 4 years ago

I'm running into the same issue. I can get it to work fine in GKE, won't work locally.

raftAtGit commented 4 years ago

Same issue on Azure AKS too.

ramkrishnan8994 commented 4 years ago

BUG REPORT

Environment:

Minikube version: v0.30.0

OS: Fedora 29

VM Driver: virtualbox, kvm2

ISO version: v0.30.0

Others:

kubernetes version: tested on v1.10.0, v1.13.0

tested with coredns and kube-dns minikube addons

What happened: NFS volume fails to mount due to DNS error (Failed to resolve server nfs-server.default.svc.cluster.local: Name or service not known). This problem does not occur when deployed on GKE.

What you expected to happen: NFS volume is mounted without an error.

How to reproduce it (as minimally and precisely as possible):

Start nfs-server:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: nfs-server
spec:
  replicas: 1
  selector:
    matchLabels:
      role: nfs-server
  template:
    metadata:
      labels:
        role: nfs-server
    spec:
      containers:
      - name: nfs-server
        image: gcr.io/google_containers/volume-nfs:0.8
        ports:
        - name: nfs
          containerPort: 2049
        - name: mountd
          containerPort: 20048
        - name: rpcbind
          containerPort: 111
        securityContext:
          privileged: true
        volumeMounts:
        - mountPath: /exports
          name: exports
      volumes:
      - name: exports
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  name: nfs-server
spec:
  ports:
  - name: nfs
    port: 2049
  - name: mountd
    port: 20048
  - name: rpcbind
    port: 111
  selector:
    role: nfs-server
Start service consuming the nfs volume (e.g. busybox):
apiVersion: v1
kind: ReplicationController
metadata:
  name: nfs-busybox
spec:
  replicas: 1
  selector:
    name: nfs-busybox
  template:
    metadata:
      labels:
        name: nfs-busybox
    spec:
      containers:
      - image: busybox
        command:
          - sh
          - -c
          - 'while true; do date > /mnt/index.html; hostname >> /mnt/index.html; sleep $(($RANDOM % 5 + 5)); done'
        imagePullPolicy: IfNotPresent
        name: busybox
        volumeMounts:
          - name: nfs
            mountPath: "/mnt"
      volumes:
      - name: nfs
        nfs:
          server: nfs-server.default.svc.cluster.local
          path: "/"
Output of minikube logs (if applicable): In kubectl describe pod nfs-busybox-... is this error:
  Warning  FailedMount  4m    kubelet, minikube  MountVolume.SetUp failed for volume "nfs" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/ab2e9ad4-f88b-11e8-8a56-4004c9e1505b/volumes/kubernetes.io~nfs/nfs --scope -- mount -t nfs nfs-server.default.svc.cluster.local:/ /var/lib/kubelet/pods/ab2e9ad4-f88b-11e8-8a56-4004c9e1505b/volumes/kubernetes.io~nfs/nfs
Output: Running scope as unit: run-r23cae2998bf349df8046ac3c61bfe4e9.scope
mount.nfs: Failed to resolve server nfs-server.default.svc.cluster.local: Name or service not known
Which indicates problem with DNS resolution for nfs-server.default.svc.cluster.local.

Note: The NFS is mounted successfully when specified by ClusterIP instead of domain name.

Anything else do we need to know: The same problem was reported already for previous version #2218, but it is closed due to inactivity of the author and no-one seems to really looked into it. There is a workaround for this, but it is required to do it every time a minikube VM is created.

When running kubectl exec -ti nfs-busybox-... -- nslookup nfs-server.default.svc.cluster.local:
Server:         10.96.0.10
Address:        10.96.0.10:53

Name:   nfs-server.default.svc.cluster.local
Address: 10.105.22.251

*** Can't find nfs-server.default.svc.cluster.local: No answer
Where strangely the service ClusterIP is present (when using kube-dns the service ClusterIP part is missing completely).

@fhaifler - With these configurations there is no data being shared between the pods. That is, anything inside the '/' is not visible inside the '/mnt' folder. Any idea why?

Also, I'm not able to mount the '/nfs-data-example-folder' into '/mnt' folder. It throws permission error. Any idea why?

fhaifler commented 4 years ago

@ramkrishnan8994 I am not sure I understand the question. Have you managed to make it work even with the domain name for nfs server (nfs-server.default.svc.cluster.local)? It is still not working for me even with updated minikube.

That is, anything inside the '/' is not visible inside the '/mnt' folder.

I am not sure what do you mean. / corresponds to root exported directory by the nfs server, therefore /exports directory inside the nfs-server pod. The same content should be visible inside nfs-busybox under /mnt directory.

Also, I'm not able to mount the '/nfs-data-example-folder' into '/mnt' folder. It throws permission error.

I don't know what /nfs-data-example-folder should be. Can you elaborate please?

tstromberg commented 4 years ago

This would likely be addressed by resolving #2162 (help wanted)

pievalentin commented 4 years ago

I run into the same issue with Azure AKS but not with Google GKE. How come Google have a fix and not other cloud provider.

SimonHeimberg commented 4 years ago

This is a known issue in Kubernetes: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#known-issues

Kubernetes installs do not configure the nodes’ resolv.conf files to use the cluster DNS by default, because that process is inherently distribution-specific. This should probably be implemented eventually.

seen in https://github.com/kubernetes/minikube/issues/2162#issuecomment-533696513

SimonHeimberg commented 4 years ago

ideas of a workarounds

write /etc/hosts of all nodes (independent of distribution) or configure nodes to use cluster dns

/etc/hosts manually

Manually write name of service in /etc/hosts on all nodes

/etc/hosts partially automated

daemonset with an init container doing the update and rancher/pause as app container. The init container gets a list of services to handle. It looks up the ip address of the services and writes name and ip in /to_edit/hosts (which is mounted from /etc/hosts of node). On changes, restart the daemonset manually.

/etc/hosts fully automated

Write a controller which listens to all services (or only specially labeled services) and writes /etc/hosts on each host. See links in https://github.com/kubernetes/kubernetes/issues/64623#issuecomment-609875003

resolve.conf manually

Update resolv.conv manually on each node. Depending on the distributon (using systemd, ...), this may be different. Find the nameserver in /etc/resolv.conf of any pod.

resolve.conf manually

daemonset with an init container doing the update and rancher/pause as app container. The init container updates /to_edit/resolv.conv, which is mounted from host. No restart required.

Tristan971 commented 4 years ago

For anyone else running into this in general (not only with minikube), I've made a small image+daemonset that basically does the later option mentionned above (daemonset updating host's /etc/systemd/resolved.conf)

~~Should work in most scenarios where the cloud provider isn't doing something too too funky with their DNS config https://github.com/Tristan971/kube-enable-coredns-on-node~~

~~(bit dirty/ad-hoc in its current state, but could be made to support more hosts setups)~~

EDIT: Brian's solution, right below, is the best current solution.

BrianHuf commented 4 years ago

I was able to solve this problem by creating a service with a static clusterIP and then mounting to the IP instead of service name. No DNS required. This is working nicely on Azure. I haven't tried elsewhere

In my case, I'm using an HDFS NFS Gateway and chose 10.0.200.2 for the clusterIP

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: hdfs
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: Service
metadata:
  name: hdfs-nfs
  labels:
    component: hdfs-nn
spec:
  type: ClusterIP
  clusterIP: 10.0.200.2
  ports:
    - name: portmapper
      port: 111
      protocol: TCP
    - name: nfs
      port: 2049
      protocol: TCP
    - name: mountd
      port: 4242
      protocol: TCP
  selector:
    component: hdfs-nn
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: hdfs
spec:
  storageClassName: hdfs
  capacity:
    storage: 3000Gi
  accessModes:
    - ReadWriteMany
  mountOptions:
    - vers=3
    - proto=tcp
    - nolock
    - noacl
    - sync    
  nfs:
    server: 10.0.200.2
    path: "/"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: hdfs
spec:
  storageClassName: hdfs
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 3000Gi

Karthik-Rangasamy commented 3 years ago

Would mounting it inside the container be an option? i.e traditional way of installing nfs-client in the container and using the mount command instead of letting the Kubernetes to mount it?

atrbgithub commented 3 years ago

@BrianHuf thanks for sharing your solution. Using minikube this works for us.

Unfortunately without this method we just get the error as per the issue title.

sharifelgamal commented 2 years ago

I'll leave this open with the workaround for discoverability, and in case we do ever fix it permanently in minikube.

willzhang commented 2 years ago

same when use csi-driver-nfs

https://github.com/kubernetes-csi/csi-driver-nfs/blob/master/deploy/example/nfs-provisioner/README.md

root@ubuntu:/data/kubevirt# kubectl describe pods nginx-nfs-example
Name:         nginx-nfs-example
Namespace:    default
Priority:     0
Node:         node1/192.168.72.31
Start Time:   Fri, 20 May 2022 18:01:08 +0800
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:           
IPs:          <none>
Containers:
  nginx:
    Container ID:   
    Image:          nginx
    Image ID:       
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v7h85 (ro)
      /var/www from pvc-nginx (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  pvc-nginx:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-nginx
    ReadOnly:   false
  kube-api-access-v7h85:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  12m (x4 over 12m)    default-scheduler  0/3 nodes are available: 3 pod has unbound immediate PersistentVolumeClaims. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Normal   Scheduled         12m                  default-scheduler  Successfully assigned default/nginx-nfs-example to node1
  Warning  FailedMount       5m34s                kubelet            Unable to attach or mount volumes: unmounted volumes=[pvc-nginx], unattached volumes=[kube-api-access-v7h85 pvc-nginx]: timed out waiting for the condition
  Warning  FailedMount       110s (x13 over 12m)  kubelet            MountVolume.SetUp failed for volume "pv-nginx" : rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs -o nfsvers=4.1 nfs-server.default.svc.cluster.local:/ /var/lib/kubelet/pods/d534e8dc-6364-40c1-989e-4448d5e6ae3c/volumes/kubernetes.io~csi/pv-nginx/mount
Output: mount.nfs: Failed to resolve server nfs-server.default.svc.cluster.local: Name or service not known
  Warning  FailedMount  62s (x4 over 10m)  kubelet  Unable to attach or mount volumes: unmounted volumes=[pvc-nginx], unattached volumes=[pvc-nginx kube-api-access-v7h85]: timed out waiting for the condition

fosmjo commented 1 year ago

@willzhang If you are using NFS CSI driver v4.1.0 or v4.0.0, try changing the dnsPolicy of csi-nfs-controller and csi-nfs-node to ClusterFirstWithHostNet, it works for me.

purplesoda commented 1 year ago

For anyone else finding themselves in the same situation, who can't use the ClusterIP service, I was also able to get it to work using the NFS CSI Driver like @fosmjo mentioned above. Apparently v4.4.0 defaults to the necessary dnsPolicy as well, so no need for configuration beyond their default helm chart. Figured I'd drop a full example for copy pasta.

Installed the helm chart from their repo:

helm repo add csi-driver-nfs https://raw.githubusercontent.com/kubernetes-csi/csi-driver-nfs/master/charts
helm install csi-driver-nfs csi-driver-nfs/csi-driver-nfs --namespace kube-system --version v4.4.0

I'm running NFS inside my cluster using the gp2 StorageClass to create an EBS-backed volume for my deployment, here's my template:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nfs-server
  namespace: storage
spec:
  replicas: 1
  selector:
    matchLabels:
      role: nfs-server
  template:
    metadata:
      labels:
        role: nfs-server
    spec:
      containers:
      - name: nfs-server
        image: itsthenetwork/nfs-server-alpine:latest
        ports:
          - name: nfs
            containerPort: 2049
        securityContext:
          privileged: true
        volumeMounts:
          - mountPath: /nfs
            name: nfs-volume
        env:
          - name: SHARED_DIRECTORY
            value: /nfs
      volumes:
        - name: nfs-volume
          persistentVolumeClaim:
            claimName: nfs-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: nfs-service
  namespace: storage
spec:
  ports:
    - name: nfs
      port: 2049
  selector:
    role: nfs-server

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc
  namespace: storage
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp2
  resources:
    requests:
      storage: 2Gi

Lastly, create the StorageClass, PVC, and Deployment that will mount your NFS share:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
        volumeMounts:
        - name: nfs
          mountPath: /usr/share/nginx/html
      volumes:
        - name: nfs
          persistentVolumeClaim:
            claimName: nfs-pvc-nginx
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-csi
provisioner: nfs.csi.k8s.io
parameters:
  server: nfs-service.storage.svc.cluster.local
  share: /

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nfs-pvc-nginx
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: nfs-csi
  resources:
    requests:
      storage: 1Gi

soroush commented 1 month ago

This has returned in csi-driver-nfs v4.7.0 The workaround for changing controller.dnsPolicy has no effect (it already is ClusterFirstWithHostNet)

re-open?

kubernetes / minikube