smartxworks / virtink

Lightweight Virtualization Add-on for Kubernetes
Apache License 2.0
481 stars 37 forks source link

Cannot access internet, DNS resolves to Kubernetes DNS timesout #103

Closed sharadregoti closed 1 month ago

sharadregoti commented 1 month ago

This is my virtual machine config

apiVersion: virt.virtink.smartx.com/v1alpha1
kind: VirtualMachine
metadata:
  labels:
    virtlink.io/os: linux
    virtlink.io/vm: "{{ .Values.virtualMachine.name }}"
  name: "{{ .Values.virtualMachine.name }}"
spec:
  instance:
    # TODO: Set up cpu
    # https://github.com/smartxworks/virtink/blob/main/docs/dedicated_cpu_placement.md
    memory:
      size: "{{ .Values.memory }}"
    interfaces:
      - name: pod
    disks:
      - name: image
      - name: cloud-init
  volumes:
    - name: image
      containerRootfs:
       # CDI Image for ubuntu jammy
        image: "{{ .Values.image }}"
        size: 10Gi
    - name: cloud-init
      cloudInit:
        userData: |-
          #cloud-config
          hostname: {{ .Values.virtualMachine.name }}
          password: ubuntu
          chpasswd: { expire: False }
          package_update: true
          package_upgrade: true
          ssh_authorized_keys:
            - {{ .Values.virtualMachine.sshKey }}
  networks:
    - name: pod
      pod: {}

Network Debugging Logs

ubuntu@node-01:~$ dig +short google.com
;; communications error to 127.0.0.53#53: timed out
;; communications error to 127.0.0.53#53: timed out
;; communications error to 127.0.0.53#53: timed out
;; no servers could be reached

ubuntu@node-01:~$ cat /etc/resolv.conf 
# This is /run/systemd/resolve/stub-resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
options edns0 trust-ad
search lab-c7a83719-938d-4836-9071-7efa3a790d23--978000.svc.cluster.local svc.cluster.local cluster.local

# Checking DNS IP configured via DHCP
ubuntu@node-01:~$ cat /run/systemd/resolve/resolv.conf
# This is /run/systemd/resolve/resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients directly to
# all known uplink DNS servers. This file lists all configured search domains.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 10.43.0.10
search lab-c7a83719-938d-4836-9071-7efa3a790d23--978000.svc.cluster.local svc.cluster.local cluster.local

Other pods can access the internet with the same core DNS configuration.

kubectl run netshoot --image=nicolaka/netshoot --rm -it --restart=Never --command -- dig +short google.com
216.58.211.238
pod "netshoot" deleted

With this configuration the VM is created but it has not internet. Because DNS resolution is not working. The DNS IP retrieved from DHCP is 10.43.0.10 (kube-dns) service IP.

I am getting timeout errors, whenever trying to resolve a domain name from the VM. Note, I am able to SSH into the VM (because it connected to my pod network).

Note: If I add 8.8.8.8 to /etc/resolv.conf. I am able to access the internet

fengye87 commented 1 month ago

Please confirm:

  1. You can access 10.43.0.10 IP in the VM
  2. You can access internet in canonical Pods
  3. You can access internet in the VM Pod
sharadregoti commented 1 month ago

You can access 10.43.0.10 IP in the VM No

ubuntu@node-01:~$ nc -zv 10.43.0.10 53
nc: connect to 10.43.0.10 port 53 (tcp) failed: Connection refused
ubuntu@node-01:~$

You can access internet in canonical Pods Yes,

  1. Using netshoot
    kubectl run netshoot --image=nicolaka/netshoot --rm -it --restart=Never --command -- dig +short google.com
    216.58.211.238
    pod "netshoot" deleted
  2. Use the ssh example from virtlink readme
    export VM_NAME=ubuntu-container-rootfs
    export VM_POD_NAME=$(kubectl get vm $VM_NAME -o jsonpath='{.status.vmPodName}')
    export VM_IP=$(kubectl get pod $VM_POD_NAME -o jsonpath='{.status.podIP}')
    kubectl run ssh-$VM_NAME --rm --image=alpine --restart=Never -it -- /bin/sh -c "apk add openssh-client && ssh ubuntu@$VM_IP"
    ---
    kubeclt exec.... commands
    apk add curl
    / # curl google.com
    <HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
    <TITLE>301 Moved</TITLE></HEAD><BODY>
    <H1>301 Moved</H1>
    The document has moved
    <A HREF="http://www.google.com/">here</A>.
    </BODY></HTML>
    / #

You can access internet in the VM Pod No

  1. Tried with my custom images as well as the image from the example from readme.md
    / # vi /etc/resolv.conf
    / # cat /etc/resolv.conf
    nameserver 10.43.0.10
    search default.svc.cluster.local svc.cluster.local cluster.local
    options ndots:5
    / # dig
    sh: dig: not found
    / # curl google.com
    curl: (6) Could not resolve host: google.com
    / #

The strange thing is that though from using the example from readme, the VM pod does not resolve google.com, but when I ssh into the VM (using the readme commands). I am getting internet.

ubuntu@ubuntu-container-rootfs:~$ curl google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
ubuntu@ubuntu-container-rootfs:~$

Another strange behaviour is that, as stated earlier the VM pod of my custom image, does not resolve internet as shared above which has similar behaviour to the readme example. But custom CDI image (ubuntu 22.04) does not resolve internet.

I do observer that during that start just for 15-20 seconds the VM has internet, then it suddenly stop working.

ubuntu@node-01:~$ curl google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
ubuntu@node-01:~$ curl google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
ubuntu@node-01:~$ curl google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
ubuntu@node-01:~$ curl google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
ubuntu@node-01:~$ curl google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
ubuntu@node-01:~$ curl google.com
<HTML><HEAD><meta http-equiv="content-type" content="text/html;charset=utf-8">
<TITLE>301 Moved</TITLE></HEAD><BODY>
<H1>301 Moved</H1>
The document has moved
<A HREF="http://www.google.com/">here</A>.
</BODY></HTML>
ubuntu@node-01:~$ curl google.com
curl: (6) Could not resolve host: google.com
ubuntu@node-01:~$ curl google.com
curl: (6) Could not resolve host: google.com
ubuntu@node-01:~$

You can try reproducing the behaviour with the below spec

apiVersion: virt.virtink.smartx.com/v1alpha1
kind: VirtualMachine
metadata:
  labels:
    app.kubernetes.io/managed-by: Helm
    virtlink.io/os: linux
    virtlink.io/vm: node-01
  name: node-01
spec:
  instance:
    cpu:
      coresPerSocket: 1
      sockets: 1
    disks:
    - name: image
    - name: cloud-init
    interfaces:
    - bridge: {}
      name: pod
    memory:
      size: 2Gi
  networks:
  - name: pod
    pod: {}
  resources: {}
  runPolicy: Once
  volumes:
  - containerRootfs:
      image: sharadregoti/vm-kubernetes:ubuntu-jammy-v1.29-v0.1.0
      size: 10Gi
    name: image
  - cloudInit:
      userData: |-
        #cloud-config
        hostname: node-01
        password: ubuntu
        chpasswd: { expire: False }
        package_update: true
        package_upgrade: true
    name: cloud-init
fengye87 commented 1 month ago

I'm getting a little bit confused here.

The strange thing is that though from using the example from readme, the VM pod does not resolve google.com, but when I ssh into the VM (using the readme commands). I am getting internet.

Are you saying that you can access internet from the official example VM? If so, it could means there's something wrong with you custom image, maybe something related to network configs or DHCP things.

From the spec you shared, you were using the Pod network? If so, it's expected that you cannot access the internet from the VM pod, since Virtink would move VM pod's MAC and IP to the VM.

sharadregoti commented 1 month ago

The issue was with my custom VM image. After trying a new approach it worked.