skupperproject / skupper

Skupper is an implementation of a Virtual Application Network, enabling rich hybrid cloud communication.
http://skupper.io
Apache License 2.0
595 stars 74 forks source link

Statefulset DNS resolution fails when exposing a service #1772

Open albacanete opened 2 weeks ago

albacanete commented 2 weeks ago

Describe the bug I do not understand how the DNS resolution works between two clusters that execute statefulsets. When using a single cluster, I can access a Pod through its name (deploying a headless svc), but cannot do the same when using Skupper. Am I missing something?

How To Reproduce

  1. Two clusters were created using kubeadm v1.29.5, containerd as CRI and Flannel as CNI. *Edge cluster* ``` acanete@rpi42:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION agx14 Ready 19d v1.29.5 agx15 Ready 19d v1.29.5 rpi42 Ready control-plane 19d v1.29.5 ``` *HPC cluster* ``` acanete@nano1:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION nano1 Ready control-plane 19d v1.29.5 workstation Ready 19d v1.29.5 ```
  2. Create a namespace with the same name in each cluster. *Edge cluster* ``` kubectl create ns compss ``` *HPC cluster* ``` kubectl create ns compss ```
  3. Deploy Skupper in each namespace. Since I am deploying it on-prem and with private IP addresses, I am using NodePort. *Edge cluster*. 192.168.50.15 is the IP address of the agx14 node. ``` skupper init -n compss --ingress nodeport --ingress-host 192.168.50.15 ``` *HPC cluster*. 192.168.50.61 is the IP address of the workstation node. ``` skupper init -n compss --ingress nodeport --ingress-host 192.168.50.61 ```
  4. Link namespaces *Edge cluster* ``` skupper -n compss token create edge.token ``` *HPC cluster*. The edge.token file was copied to a machine on the HPC cluster. ``` skupper -n compss link create edge.token ``` Output ``` acanete@rpi42:~$ skupper -n compss link status Links created from this site: There are no links configured or connected Current links from other sites that are connected: Incoming link from site 472fdc04-1406-4281-bbf9-81f5e5ad3737 on namespace compss ``` ``` acanete@nano1:~$ skupper -n compss link status Links created from this site: Link link1 is connected Current links from other sites that are connected: There are no connected links ```
  5. Deploy test applications in both clusters The YAML file for the StatefulSet that runs in the edge cluster is ``` apiVersion: v1 kind: Service metadata: name: compss-matmul-4fc9d6 namespace: compss spec: clusterIP: None # This makes it a headless service selector: app: compss wf_id: compss-matmul-4fc9d6 ports: - name: port-22 protocol: TCP port: 22 targetPort: ssh-port --- apiVersion: apps/v1 kind: StatefulSet metadata: name: compss-matmul-4fc9d6-worker namespace: compss spec: selector: matchLabels: app: compss wf_id: compss-matmul-4fc9d6 pod-hostname: worker serviceName: compss-matmul-4fc9d6 replicas: 2 ordinals: start: 2 template: metadata: labels: app: compss wf_id: compss-matmul-4fc9d6 pod-hostname: worker spec: subdomain: compss-matmul-4fc9d6 dnsConfig: searches: - compss-matmul-4fc9d6.compss.svc.cluster.local containers: - name: worker image: albabsc/compss-matmul:verge-0.1.8 command: [ "/usr/sbin/sshd", "-D" ] resources: limits: memory: 2G cpu: 4 ports: - containerPort: 22 name: ssh-port ``` The YAML file for the StatefulSet that runs in the HPC cluster is ``` apiVersion: v1 kind: Service metadata: name: compss-matmul-4fc9d6 namespace: compss spec: clusterIP: None # This makes it a headless service selector: app: compss wf_id: compss-matmul-4fc9d6 ports: - name: port-22 protocol: TCP port: 22 targetPort: ssh-port --- apiVersion: apps/v1 kind: StatefulSet metadata: name: compss-matmul-4fc9d6-worker namespace: compss spec: selector: matchLabels: app: compss wf_id: compss-matmul-4fc9d6 pod-hostname: worker serviceName: compss-matmul-4fc9d6 replicas: 2 template: metadata: labels: app: compss wf_id: compss-matmul-4fc9d6 pod-hostname: worker spec: subdomain: compss-matmul-4fc9d6 dnsConfig: searches: - compss-matmul-4fc9d6.compss.svc.cluster.local containers: - name: worker image: albabsc/compss-matmul:verge-0.1.8 command: [ "/usr/sbin/sshd", "-D" ] resources: limits: memory: 2G cpu: 4 ports: - containerPort: 22 name: ssh-port ```
  6. Ensure connection among pods of the same cluster *Edge cluster* ``` acanete@nano1:~$ kubectl -n compss get pods NAME READY STATUS RESTARTS AGE compss-matmul-4fc9d6-worker-0 1/1 Running 0 4m19s compss-matmul-4fc9d6-worker-1 1/1 Running 0 4m18s skupper-router-6895bb6f95-88hnj 2/2 Running 0 55m skupper-service-controller-559ddbdd56-wnsvh 1/1 Running 0 55m ``` ``` acanete@nano1:~$ kubectl -n compss exec -it compss-matmul-4fc9d6-worker-0 -- bash root@compss-matmul-4fc9d6-worker-0:/# ssh compss-matmul-4fc9d6-worker-1 Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 6.8.0-47-generic x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/pro This system has been minimized by removing packages and content that are not required on a system that users do not log into. To restore this content, you can run the 'unminimize' command. Last login: Wed Nov 6 12:33:29 2024 from 10.244.1.160 root@compss-matmul-4fc9d6-worker-1:~# ``` *HPC cluster* ``` acanete@rpi42:~$ kubectl -n compss get pods NAME READY STATUS RESTARTS AGE compss-matmul-4fc9d6-worker-2 1/1 Running 0 4m48s compss-matmul-4fc9d6-worker-3 1/1 Running 0 4m47s skupper-router-748c487879-gvpxg 2/2 Running 0 56m skupper-service-controller-6f69b974bd-grgzc 1/1 Running 0 56m ``` ``` acanete@rpi42:~$ kubectl -n compss exec -ti compss-matmul-4fc9d6-worker-2 -- bash root@compss-matmul-4fc9d6-worker-2:/# ssh compss-matmul-4fc9d6-worker-3 Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.10.192-tegra aarch64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/pro This system has been minimized by removing packages and content that are not required on a system that users do not log into. To restore this content, you can run the 'unminimize' command. Last login: Wed Nov 6 12:49:41 2024 from 10.244.2.97 root@compss-matmul-4fc9d6-worker-3:~# ```
  7. Expose service with Skupper Command executed in the *edge cluster* ``` skupper -n compss expose service compss-matmul-4fc9d6 --port 22 --address compss-matmul-4fc9d6 ``` Check the service is correctly created ``` acanete@rpi42:~$ skupper -n compss service status Services exposed through Skupper: ╰─ compss-matmul-4fc9d6:22 (tcp) ``` ``` acanete@nano1:~$ skupper -n compss service status Services exposed through Skupper: ╰─ compss-matmul-4fc9d6:22 (tcp) ```
  8. DNS resolution no longer works When trying to ssh between two pods of the same cluster (e.g edge) ``` acanete@nano1:~$ kubectl -n compss get pods NAME READY STATUS RESTARTS AGE compss-matmul-4fc9d6-worker-0 1/1 Running 0 4m19s compss-matmul-4fc9d6-worker-1 1/1 Running 0 4m18s skupper-router-6895bb6f95-88hnj 2/2 Running 0 55m skupper-service-controller-559ddbdd56-wnsvh 1/1 Running 0 55m ``` ``` acanete@nano1:~$ kubectl -n compss exec -it compss-matmul-4fc9d6-worker-0 -- bash root@compss-matmul-4fc9d6-worker-0:/# ssh compss-matmul-4fc9d6-worker-1 ssh: Could not resolve hostname compss-matmul-4fc9d6-worker-1: No address associated with hostname root@compss-matmul-4fc9d6-worker-0:/# ssh compss-matmul-4fc9d6-worker-2 ssh: Could not resolve hostname compss-matmul-4fc9d6-worker-2: No address associated with hostname root@compss-matmul-4fc9d6-worker-0:/# ```

Expected behavior I would like for every Pod of a StatefulSet to be accessed through their Pod names, or to know the name I have to use. And to know if the name is different when a Pod in cluster 1 want to access a Pod in cluster 2.

Environment details

Additional context Pods have the the following /etc/resolv.conf file

search compss.svc.cluster.local svc.cluster.local cluster.local lan compss-matmul-4fc9d6.compss.svc.cluster.local
nameserver 10.96.0.10
options ndots:5
fgiorgetti commented 2 weeks ago

Hello Alba,

You're exposing a service, but if your intention is to have access to each pod by name directly, I would recommend you adding the "--headless" flag to the "skupper expose" command, and, instead of exposing the "service", you have to expose the "statefulset" workload.

Thank you,

On Wed, Nov 6, 2024 at 9:56 AM Alba Cañete Garrucho < @.***> wrote:

Describe the bug I do not understand how the DNS resolution works between two clusters that execute statefulsets. When using a single cluster, I can access a Pod through its name (deploying a headless svc), but cannot do the same when using Skupper. Am I missing something?

How To Reproduce

  1. Two clusters were created using kubeadm v1.29.5, containerd as CRI and Flannel as CNI.

    Edge cluster

    @.***:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION agx14 Ready 19d v1.29.5 agx15 Ready 19d v1.29.5 rpi42 Ready control-plane 19d v1.29.5

    HPC cluster

    @.***:~$ kubectl get nodes NAME STATUS ROLES AGE VERSION nano1 Ready control-plane 19d v1.29.5 workstation Ready 19d v1.29.5

  2. Create a namespace with the same name in each cluster.

    Edge cluster

    kubectl create ns compss

    HPC cluster

    kubectl create ns compss

  3. Deploy Skupper in each namespace. Since I am deploying it on-prem and with private IP addresses, I am using NodePort.

    Edge cluster. 192.168.50.15 is the IP address of the agx14 node.

    skupper init -n compss --ingress nodeport --ingress-host 192.168.50.15

    HPC cluster. 192.168.50.61 is the IP address of the workstation node.

    skupper init -n compss --ingress nodeport --ingress-host 192.168.50.61

  4. Link namespaces

    Edge cluster

    skupper -n compss token create edge.token

    HPC cluster. The edge.token file was copied to a machine on the HPC cluster.

    skupper -n compss link create edge.token

    Output

    @.***:~$ skupper -n compss link status

    Links created from this site:

    There are no links configured or connected

    Current links from other sites that are connected:

    Incoming link from site 472fdc04-1406-4281-bbf9-81f5e5ad3737 on namespace compss

    @.***:~$ skupper -n compss link status

    Links created from this site:

    Link link1 is connected

    Current links from other sites that are connected:

    There are no connected links

  5. Deploy test applications in both clusters

    The YAML file for the StatefulSet that runs in the edge cluster is

    apiVersion: v1 kind: Service metadata: name: compss-matmul-4fc9d6 namespace: compss spec: clusterIP: None # This makes it a headless service selector: app: compss wf_id: compss-matmul-4fc9d6 ports:

    • name: port-22 protocol: TCP port: 22 targetPort: ssh-port

      apiVersion: apps/v1 kind: StatefulSet metadata: name: compss-matmul-4fc9d6-worker namespace: compss spec: selector: matchLabels: app: compss wf_id: compss-matmul-4fc9d6 pod-hostname: worker serviceName: compss-matmul-4fc9d6 replicas: 2 ordinals: start: 2 template: metadata: labels: app: compss wf_id: compss-matmul-4fc9d6 pod-hostname: worker spec: subdomain: compss-matmul-4fc9d6 dnsConfig: searches:

      • compss-matmul-4fc9d6.compss.svc.cluster.local containers:
        • name: worker image: albabsc/compss-matmul:verge-0.1.8 command: [ "/usr/sbin/sshd", "-D" ] resources: limits: memory: 2G cpu: 4 ports:
      • containerPort: 22 name: ssh-port

    The YAML file for the StatefulSet that runs in the HPC cluster is

    apiVersion: v1 kind: Service metadata: name: compss-matmul-4fc9d6 namespace: compss spec: clusterIP: None # This makes it a headless service selector: app: compss wf_id: compss-matmul-4fc9d6 ports:

    • name: port-22 protocol: TCP port: 22 targetPort: ssh-port

      apiVersion: apps/v1 kind: StatefulSet metadata: name: compss-matmul-4fc9d6-worker namespace: compss spec: selector: matchLabels: app: compss wf_id: compss-matmul-4fc9d6 pod-hostname: worker serviceName: compss-matmul-4fc9d6 replicas: 2 template: metadata: labels: app: compss wf_id: compss-matmul-4fc9d6 pod-hostname: worker spec: subdomain: compss-matmul-4fc9d6 dnsConfig: searches:

      • compss-matmul-4fc9d6.compss.svc.cluster.local containers:
        • name: worker image: albabsc/compss-matmul:verge-0.1.8 command: [ "/usr/sbin/sshd", "-D" ] resources: limits: memory: 2G cpu: 4 ports:
      • containerPort: 22 name: ssh-port
  6. Ensure connection among pods of the same cluster

    Edge cluster

    @.***:~$ kubectl -n compss get pods NAME READY STATUS RESTARTS AGE compss-matmul-4fc9d6-worker-0 1/1 Running 0 4m19s compss-matmul-4fc9d6-worker-1 1/1 Running 0 4m18s skupper-router-6895bb6f95-88hnj 2/2 Running 0 55m skupper-service-controller-559ddbdd56-wnsvh 1/1 Running 0 55m

    @.:~$ kubectl -n compss exec -it compss-matmul-4fc9d6-worker-0 -- bash @.:/# ssh compss-matmul-4fc9d6-worker-1 Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 6.8.0-47-generic x86_64)

    This system has been minimized by removing packages and content that are not required on a system that users do not log into.

    To restore this content, you can run the 'unminimize' command. Last login: Wed Nov 6 12:33:29 2024 from 10.244.1.160 @.***:~#

    HPC cluster

    @.***:~$ kubectl -n compss get pods NAME READY STATUS RESTARTS AGE compss-matmul-4fc9d6-worker-2 1/1 Running 0 4m48s compss-matmul-4fc9d6-worker-3 1/1 Running 0 4m47s skupper-router-748c487879-gvpxg 2/2 Running 0 56m skupper-service-controller-6f69b974bd-grgzc 1/1 Running 0 56m

    @.:~$ kubectl -n compss exec -ti compss-matmul-4fc9d6-worker-2 -- bash @.:/# ssh compss-matmul-4fc9d6-worker-3 Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 5.10.192-tegra aarch64)

    This system has been minimized by removing packages and content that are not required on a system that users do not log into.

    To restore this content, you can run the 'unminimize' command. Last login: Wed Nov 6 12:49:41 2024 from 10.244.2.97 @.***:~#

  7. Expose service with Skupper

    Command executed in the edge cluster

    skupper -n compss expose service compss-matmul-4fc9d6 --port 22 --address compss-matmul-4fc9d6

    Check the service is correctly created

    @.***:~$ skupper -n compss service status Services exposed through Skupper: ╰─ compss-matmul-4fc9d6:22 (tcp)

    @.***:~$ skupper -n compss service status Services exposed through Skupper: ╰─ compss-matmul-4fc9d6:22 (tcp)

  8. DNS resolution no longer works

    When trying to ssh between two pods of the same cluster (e.g edge)

    @.***:~$ kubectl -n compss get pods NAME READY STATUS RESTARTS AGE compss-matmul-4fc9d6-worker-0 1/1 Running 0 4m19s compss-matmul-4fc9d6-worker-1 1/1 Running 0 4m18s skupper-router-6895bb6f95-88hnj 2/2 Running 0 55m skupper-service-controller-559ddbdd56-wnsvh 1/1 Running 0 55m

    @.:~$ kubectl -n compss exec -it compss-matmul-4fc9d6-worker-0 -- bash @.:/# ssh compss-matmul-4fc9d6-worker-1 ssh: Could not resolve hostname compss-matmul-4fc9d6-worker-1: No address associated with hostname @.:/# ssh compss-matmul-4fc9d6-worker-2 ssh: Could not resolve hostname compss-matmul-4fc9d6-worker-2: No address associated with hostname @.:/#

Expected behavior I would like for every Pod of a StatefulSet to be accessed through their Pod names, or to know the name I have to use. And to know if the name is different when a Pod in cluster 1 want to access a Pod in cluster 2.

Environment details

  • Skupper CLI: 1.8.1
  • Skupper Operator (if applicable): none
  • Platform: kubernetes

Additional context Pods have the the following /etc/resolv.conf file

search compss.svc.cluster.local svc.cluster.local cluster.local lan compss-matmul-4fc9d6.compss.svc.cluster.local nameserver 10.96.0.10 options ndots:5

— Reply to this email directly, view it on GitHub https://github.com/skupperproject/skupper/issues/1772, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYML4SDQ7QQS2RJUTFB7CTZ7IGXFAVCNFSM6AAAAABRIYZVSWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGYZTQMBRHEZTKOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

albacanete commented 2 weeks ago

Hello @fgiorgetti, thanks for the quick answer :)

I have also tried it and have not been able to make it work. What I have done is:

  1. Unexpose the service skupper -n compss unexpose service compss-matmul-4fc9d6 --address compss-matmul-4fc9d6 and check
    
    acanete@rpi42:~$ skupper -n compss service status
    No services defined
    ``
  2. Exposed the statefulset in the edge cluster executing the following command skupper -n compss expose statefulset compss-matmul-4fc9d6-worker --headless --port 22 Now, two new proxy pods and a svc are created
    acanete@rpi42:~$ kubectl -n compss get pods
    NAME                                          READY   STATUS    RESTARTS   AGE
    compss-matmul-4fc9d6-proxy-0                  1/1     Running   0          7m49s
    compss-matmul-4fc9d6-proxy-1                  1/1     Running   0          7m46s
    compss-matmul-4fc9d6-worker-2                 1/1     Running   0          57m
    compss-matmul-4fc9d6-worker-3                 1/1     Running   0          57m
    skupper-router-748c487879-gvpxg               2/2     Running   0          108m
    skupper-service-controller-6f69b974bd-grgzc   1/1     Running   0          108m
    acanete@rpi42:~$ kubectl -n compss get svc
    NAME                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                          AGE
    compss-matmul-4fc9d6         ClusterIP   None            <none>        22/TCP                                           55m
    compss-matmul-4fc9d6-proxy   ClusterIP   None            <none>        22/TCP                                           5m23s
    skupper-router               NodePort    10.96.210.62    <none>        55671:30581/TCP,45671:30524/TCP,8081:32381/TCP   106m
    skupper-router-local         ClusterIP   10.98.178.162   <none>        5671/TCP                                         106m
  3. Tried to connect to a worker in the edge cluster from a worker in the HPC cluster
    acanete@nano1:~$ kubectl -n compss exec -it compss-matmul-4fc9d6-worker-0 -- bash
    root@compss-matmul-4fc9d6-worker-0:/# ssh compss-matmul-4fc9d6-worker-2
    ssh: Could not resolve hostname compss-matmul-4fc9d6-worker-2: No address associated with hostname

    Also tried connecting to the newly created proxy pods

    root@compss-matmul-4fc9d6-worker-0:/# ssh compss-matmul-4fc9d6-proxy-0
    ssh: Could not resolve hostname compss-matmul-4fc9d6-proxy-0: No address associated with hostname
fgiorgetti commented 2 weeks ago

Since you're deploying the same statefulset and service on both namespaces, could you try to modify their names in one of the clusters, possibly just changing the suffix in one of them?

This way, skupper will basically create different statefulset proxies and headless services, and will avoid name clashes with the generated resources on each cluster/namespace.

Suppose you modify the suffix in one of your clusters, from 4fc9d6 to 4fc9d7, then you should be able to reach your distinct pods, using the following names:

compss-matmul-4fc9d6-worker-0.compss-matmul-4fc9d6 compss-matmul-4fc9d6-worker-1.compss-matmul-4fc9d6 compss-matmul-4fc9d7-worker-0.compss-matmul-4fc9d7 compss-matmul-4fc9d7-worker-1.compss-matmul-4fc9d7

Basically on the remote namespaces, Skupper will create a statefulset and a headless service that have the same name (from the originally exposed statefulset) on the other cluster/namespace. So if your statefulsets and headless services have the same names on both sides, I believe it won't work as expected.

On Wed, Nov 6, 2024 at 10:29 AM Alba Cañete Garrucho < @.***> wrote:

Hello @fgiorgetti https://github.com/fgiorgetti, thanks for the quick answer :)

I have also tried it and have not been able to make it work. What I have done is:

  1. Unexpose the service skupper -n compss unexpose service compss-matmul-4fc9d6 --address compss-matmul-4fc9d6 and check

    @.***:~$ skupper -n compss service status No services defined ``

  2. Exposed the statefulset in the edge cluster executing the following command skupper -n compss expose statefulset compss-matmul-4fc9d6-worker --headless --port 22 Now, two new proxy pods and a svc are created

    @.***:~$ kubectl -n compss get pods NAME READY STATUS RESTARTS AGE compss-matmul-4fc9d6-proxy-0 1/1 Running 0 7m49s compss-matmul-4fc9d6-proxy-1 1/1 Running 0 7m46s compss-matmul-4fc9d6-worker-2 1/1 Running 0 57m compss-matmul-4fc9d6-worker-3 1/1 Running 0 57m skupper-router-748c487879-gvpxg 2/2 Running 0 108m skupper-service-controller-6f69b974bd-grgzc 1/1 Running 0 108m

    @.***:~$ kubectl -n compss get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE compss-matmul-4fc9d6 ClusterIP None 22/TCP 55m compss-matmul-4fc9d6-proxy ClusterIP None 22/TCP 5m23s skupper-router NodePort 10.96.210.62 55671:30581/TCP,45671:30524/TCP,8081:32381/TCP 106m skupper-router-local ClusterIP 10.98.178.162 5671/TCP 106m

  3. Tried to connect to a worker in the edge cluster from a worker in the HPC cluster

    @.:~$ kubectl -n compss exec -it compss-matmul-4fc9d6-worker-0 -- bash @.:/# ssh compss-matmul-4fc9d6-worker-2 ssh: Could not resolve hostname compss-matmul-4fc9d6-worker-2: No address associated with hostname

    Also tried connecting to the newly created proxy pods

    @.***:/# ssh compss-matmul-4fc9d6-proxy-0 ssh: Could not resolve hostname compss-matmul-4fc9d6-proxy-0: No address associated with hostname

— Reply to this email directly, view it on GitHub https://github.com/skupperproject/skupper/issues/1772#issuecomment-2459755785, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYML4SPGJ5TVSL67TR2HO3Z7IKS3AVCNFSM6AAAAABRIYZVSWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJZG42TKNZYGU . You are receiving this because you were mentioned.Message ID: @.***>

albacanete commented 1 week ago

hello @fgiorgetti, I have modified the YAML files and now they are

Edge cluster ``` apiVersion: v1 kind: Service metadata: name: compss-matmul-4fc9d61 namespace: compss spec: clusterIP: None # This makes it a headless service selector: app: compss wf_id: compss-matmul-4fc9d61 ports: - name: port-22 protocol: TCP port: 22 targetPort: ssh-port --- apiVersion: apps/v1 kind: StatefulSet metadata: name: compss-matmul-4fc9d61-worker namespace: compss spec: selector: matchLabels: app: compss wf_id: compss-matmul-4fc9d61 pod-hostname: worker serviceName: compss-matmul-4fc9d61 replicas: 2 ordinals: start: 2 template: metadata: labels: app: compss wf_id: compss-matmul-4fc9d61 pod-hostname: worker spec: subdomain: compss-matmul-4fc9d61 dnsConfig: searches: - compss-matmul-4fc9d61.compss.svc.cluster.local containers: - name: worker image: albabsc/compss-matmul:verge-0.1.8 command: [ "/usr/sbin/sshd", "-D" ] resources: limits: memory: 2G cpu: 4 ports: - containerPort: 22 name: ssh-port ```
HPC cluster ``` apiVersion: v1 kind: Service metadata: name: compss-matmul-4fc9d6 namespace: compss spec: clusterIP: None # This makes it a headless service selector: app: compss wf_id: compss-matmul-4fc9d6 ports: - name: port-22 protocol: TCP port: 22 targetPort: ssh-port --- apiVersion: apps/v1 kind: StatefulSet metadata: name: compss-matmul-4fc9d6-worker namespace: compss spec: selector: matchLabels: app: compss wf_id: compss-matmul-4fc9d6 pod-hostname: worker serviceName: compss-matmul-4fc9d6 replicas: 2 template: metadata: labels: app: compss wf_id: compss-matmul-4fc9d6 pod-hostname: worker spec: subdomain: compss-matmul-4fc9d6 dnsConfig: searches: - compss-matmul-4fc9d6.compss.svc.cluster.local containers: - name: worker image: albabsc/compss-matmul:verge-0.1.8 command: [ "/usr/sbin/sshd", "-D" ] resources: limits: memory: 2G cpu: 4 ports: - containerPort: 22 name: ssh-port ```

I have deployed both YAMLs and executed the following command on the edge cluster:

acanete@rpi42:~$ skupper -n compss expose statefulset compss-matmul-4fc9d61-worker --headless --port 22

When the statefulset in the edge cluster gets exposed, the new pods appear in the HPC cluster

acanete@nano1:~$ kubectl -n compss get pods
NAME                                          READY   STATUS    RESTARTS   AGE
compss-matmul-4fc9d6-worker-0                 1/1     Running   0          30s
compss-matmul-4fc9d6-worker-1                 1/1     Running   0          29s
compss-matmul-4fc9d61-worker-0                1/1     Running   0          9s
compss-matmul-4fc9d61-worker-1                1/1     Running   0          7s
skupper-router-f88bff6f9-4mskr                2/2     Running   0          98s
skupper-service-controller-655bf9fbf8-8gdln   1/1     Running   0          98s

Now the DNS resolution is ok, but ssh fails to create the connection. Do you know if it has to do with the implementation of Skupper's security? The docker image has the ssh keys inside and I can ssh to pods in the same cluster Pod in a different cluster

acanete@nano1:~$ kubectl -n compss exec -ti compss-matmul-4fc9d6-worker-0 -- bash
root@compss-matmul-4fc9d6-worker-0:/# nslookup compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61
;; Got recursion not available from 10.96.0.10
Server:     10.96.0.10
Address:    10.96.0.10#53

Name:   compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61.compss.svc.cluster.local
Address: 10.244.1.184
;; Got recursion not available from 10.96.0.10

root@compss-matmul-4fc9d6-worker-0:/# ssh compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61         
ssh: connect to host compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61 port 22: Connection refused

Pod in the same cluster

acanete@nano1:~$ kubectl -n compss exec -ti compss-matmul-4fc9d6-worker-0 -- bash
root@compss-matmul-4fc9d6-worker-0:/# ssh compss-matmul-4fc9d6-worker-1
Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 6.8.0-47-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/pro

This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.
Last login: Sat Nov  9 16:39:10 2024 from 10.244.1.182
fgiorgetti commented 1 week ago

Hello Alba,

Looking at your statefulset, I noticed it has the following specification:

ordinals: start: 2

Do you really need to set the start index for your worker pods?

If you remove it, I believe it should work for you, as the remote proxy pods created by Skupper will have the appropriate names and the local proxy pods (on the same cluster and namespace of your exposed statefulset) will target the correct local pods as well.

Otherwise, the worker pods are created as compss-matmul-4fc9d61-worker-2 and compss-matmul-4fc9d61-worker-3, which is currently not supported as proxy pods won't work properly.

In case you can remove the ordinals.start definition, then you should be able to access your services using: .<service-name>, with service-name being the value of spec.serviceName from your statefulset, i.e:

On Sat, Nov 9, 2024 at 1:44 PM Alba Cañete Garrucho < @.***> wrote:

hello @fgiorgetti https://github.com/fgiorgetti, I have modified the YAML files and now they are Edge cluster

apiVersion: v1 kind: Service metadata: name: compss-matmul-4fc9d61 namespace: compss spec: clusterIP: None # This makes it a headless service selector: app: compss wf_id: compss-matmul-4fc9d61 ports:

  • name: port-22 protocol: TCP port: 22 targetPort: ssh-port

    apiVersion: apps/v1 kind: StatefulSet metadata: name: compss-matmul-4fc9d61-worker namespace: compss spec: selector: matchLabels: app: compss wf_id: compss-matmul-4fc9d61 pod-hostname: worker serviceName: compss-matmul-4fc9d61 replicas: 2 ordinals: start: 2 template: metadata: labels: app: compss wf_id: compss-matmul-4fc9d61 pod-hostname: worker spec: subdomain: compss-matmul-4fc9d61 dnsConfig: searches:

    • compss-matmul-4fc9d61.compss.svc.cluster.local containers:
      • name: worker image: albabsc/compss-matmul:verge-0.1.8 command: [ "/usr/sbin/sshd", "-D" ] resources: limits: memory: 2G cpu: 4 ports:
    • containerPort: 22 name: ssh-port

HPC cluster

apiVersion: v1 kind: Service metadata: name: compss-matmul-4fc9d6 namespace: compss spec: clusterIP: None # This makes it a headless service selector: app: compss wf_id: compss-matmul-4fc9d6 ports:

  • name: port-22 protocol: TCP port: 22 targetPort: ssh-port

    apiVersion: apps/v1 kind: StatefulSet metadata: name: compss-matmul-4fc9d6-worker namespace: compss spec: selector: matchLabels: app: compss wf_id: compss-matmul-4fc9d6 pod-hostname: worker serviceName: compss-matmul-4fc9d6 replicas: 2 template: metadata: labels: app: compss wf_id: compss-matmul-4fc9d6 pod-hostname: worker spec: subdomain: compss-matmul-4fc9d6 dnsConfig: searches:

    • compss-matmul-4fc9d6.compss.svc.cluster.local containers:
      • name: worker image: albabsc/compss-matmul:verge-0.1.8 command: [ "/usr/sbin/sshd", "-D" ] resources: limits: memory: 2G cpu: 4 ports:
    • containerPort: 22 name: ssh-port

I have deployed both YAMLs and executed the following command on the edge cluster:

@.***:~$ skupper -n compss expose statefulset compss-matmul-4fc9d61-worker --headless --port 22

When the statefulset in the edge cluster gets exposed, the new pods appear in the HPC cluster

@.***:~$ kubectl -n compss get pods NAME READY STATUS RESTARTS AGE compss-matmul-4fc9d6-worker-0 1/1 Running 0 30s compss-matmul-4fc9d6-worker-1 1/1 Running 0 29s compss-matmul-4fc9d61-worker-0 1/1 Running 0 9s compss-matmul-4fc9d61-worker-1 1/1 Running 0 7s skupper-router-f88bff6f9-4mskr 2/2 Running 0 98s skupper-service-controller-655bf9fbf8-8gdln 1/1 Running 0 98s

Now the DNS resolution is ok, but ssh fails to create the connection. Do you know if it has to do with the implementation of Skupper's security? The docker image has the ssh keys inside and I can ssh to pods in the same cluster Pod in a different cluster

@.:~$ kubectl -n compss exec -ti compss-matmul-4fc9d6-worker-0 -- bash @.:/# nslookup compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61 ;; Got recursion not available from 10.96.0.10 Server: 10.96.0.10 Address: 10.96.0.10#53

Name: compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61.compss.svc.cluster.local Address: 10.244.1.184 ;; Got recursion not available from 10.96.0.10

@.***:/# ssh compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61 ssh: connect to host compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61 port 22: Connection refused

Pod in the same cluster

@.:~$ kubectl -n compss exec -ti compss-matmul-4fc9d6-worker-0 -- bash @.:/# ssh compss-matmul-4fc9d6-worker-1 Welcome to Ubuntu 20.04.6 LTS (GNU/Linux 6.8.0-47-generic x86_64)

This system has been minimized by removing packages and content that are not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command. Last login: Sat Nov 9 16:39:10 2024 from 10.244.1.182

— Reply to this email directly, view it on GitHub https://github.com/skupperproject/skupper/issues/1772#issuecomment-2466280182, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYML4V5XNNMO4BWNTZ4EJ3Z7Y3YBAVCNFSM6AAAAABRIYZVSWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRWGI4DAMJYGI . You are receiving this because you were mentioned.Message ID: @.***>

albacanete commented 2 days ago

Hello @fgiorgetti, I deployed it as you mention but I still get connection refused

acanete@nano1:~$ kubectl -n compss exec -ti compss-matmul-4fc9d6-worker-0 -- bash
root@compss-matmul-4fc9d6-worker-0:/# ssh compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61
ssh: connect to host compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61 port 22: Connection refused
root@compss-matmul-4fc9d6-worker-0:/# ssh compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61 -vvv
OpenSSH_8.9p1 Ubuntu-3ubuntu0.10, OpenSSL 3.0.2 15 Mar 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts' -> '/root/.ssh/known_hosts'
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts2' -> '/root/.ssh/known_hosts2'
debug2: resolving "compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61" port 22
debug3: resolve_host: lookup compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61:22
debug3: ssh_connect_direct: entering
debug1: Connecting to compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61 [10.244.1.221] port 22.
debug3: set_sock_tos: set socket 3 IP_TOS 0x10
debug1: connect to address 10.244.1.221 port 22: Connection refused
ssh: connect to host compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61 port 22: Connection refused

Pods

acanete@nano1:~$ kubectl -n compss get pods 
NAME                                          READY   STATUS    RESTARTS   AGE
compss-matmul-4fc9d6-worker-0                 1/1     Running   0          17m
compss-matmul-4fc9d6-worker-1                 1/1     Running   0          15m
compss-matmul-4fc9d61-worker-0                1/1     Running   0          7m47s
compss-matmul-4fc9d61-worker-1                1/1     Running   0          7m45s
skupper-router-6d4f86ff78-mws6g               2/2     Running   0          18m
skupper-service-controller-797f97b858-dg4rp   1/1     Running   0          18m
albacanete commented 2 days ago

Hello again @fgiorgetti :)

With further debugging I have realized that the IP of the Pod and the IP resolved by the DNS with Skupper are different: IP of the Pod: 10.244.1.131

root@compss-matmul-4fc9d61-worker-0:/# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if95: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default 
    link/ether da:7d:6c:7d:46:e2 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.244.1.131/24 brd 10.244.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::d87d:6cff:fe7d:46e2/64 scope link 
       valid_lft forever preferred_lft forever

IP resolved by DNS: 10.244.1.227 with ping:

root@compss-matmul-4fc9d6-worker-0:/# ping compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61
PING compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61.compss.svc.cluster.local (10.244.1.227) 56(84) bytes of data.
64 bytes from compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61.compss.svc.cluster.local (10.244.1.227): icmp_seq=1 ttl=64 time=0.020 ms
64 bytes from compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61.compss.svc.cluster.local (10.244.1.227): icmp_seq=2 ttl=64 time=0.029 ms
64 bytes from compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61.compss.svc.cluster.local (10.244.1.227): icmp_seq=3 ttl=64 time=0.028 ms
64 bytes from compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61.compss.svc.cluster.local (10.244.1.227): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61.compss.svc.cluster.local (10.244.1.227): icmp_seq=5 ttl=64 time=0.030 ms

with ssh:

root@compss-matmul-4fc9d6-worker-0:/# ssh -vvv compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61
OpenSSH_8.9p1 Ubuntu-3ubuntu0.10, OpenSSL 3.0.2 15 Mar 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts' -> '/root/.ssh/known_hosts'
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts2' -> '/root/.ssh/known_hosts2'
debug2: resolving "compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61" port 22
debug3: resolve_host: lookup compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61:22
debug3: ssh_connect_direct: entering
debug1: Connecting to compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61 [10.244.1.227] port 22.
debug3: set_sock_tos: set socket 3 IP_TOS 0x10
debug1: connect to address 10.244.1.227 port 22: Connection refused
ssh: connect to host compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61 port 22: Connection refused
albacanete commented 2 days ago

Further debugging, even though I executed the command skupper -n compss expose statefulset compss-matmul-4fc9d61-worker --headless --port 22, when trying to list the services exposed I get nothing...

acanete@rpi42:~$ skupper -n compss service status
No services defined
fgiorgetti commented 1 day ago

Some clusters might have securitycontextconstraints preventing pods from running as root, therefore they won't be able to bind system ports (<1024). I am not sure if that is what you're facing, but make sure the worker pods created by Skupper on the remote cluster do not have any issue binding port 22, for example:

$ kubectl logs compss-matmul-4fc9d61-worker-0 | grep denied | tail -1 2024-11-20 14:12:02.035365 +0000 FLOW_LOG (info) LOG [hlGYm:11628922] BEGIN END parent=hlGYm:0 logSeverity=3 logText=LOG_ROUTER: Listener ingress:22: proactor listener error on 0.0.0.0:22: proton:io (Permission denied - listen on 0.0.0.0:22) sourceFile=/build/src/adaptors/adaptor_listener.c sourceLine=172

This could indicate that the pods created by Skupper are unable to bind port 22.

Anyway, I have made some small modifications to your original statefulset to use port 2222 instead, as a way to ensure the system ports are not the root cause.

https://gist.github.com/fgiorgetti/953722df46088a98b2f5f49d6a22ec93

I have deployed the Statefulset above (basically yours with a custom image) to a local cluster named west. Then I linked the west cluster to a remote cluster I am calling east.

At this point, the statefulset is running on the west cluster and I have not yet exposed it to the Skupper network.

Here is how it looks from the west cluster:

west $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES compss-matmul-4fc9d61-worker-0 1/1 Running 0 8m47s 10.244.5.213 minikube compss-matmul-4fc9d61-worker-1 1/1 Running 0 8m23s 10.244.5.214 minikube

west $ kubectl get service compss-matmul-4fc9d61 -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR compss-matmul-4fc9d61 ClusterIP None 2222/TCP 9m6s app=compss,wf_id=compss-matmul-4fc9d61

Running an SSH client pod on the "west" cluster, where the SSHD worker pods are actually running, I can establish a connection (note that the IP returned is the pod ip and that skupper does not manipulate IPs or DNS):

west $ kubectl run ssh-client -it --image quay.io/fgiorgetti/rhel9-sshd -- bash

@.*** /]# ping compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61 PING compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61.fg1.svc.cluster.local (10.244.5.213) 56(84) bytes of data. 64 bytes from compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61.fg1.svc.cluster.local (10.244.5.213): icmp_seq=1 ttl=64 time=0.050 ms

@. /]# ssh -p 2222 @. The authenticity of host '[compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61]:2222 ([10.244.5.213]:2222)' can't be established. ED25519 key fingerprint is SHA256:lyyTCcGkE2kYBaaIFUzPVYD1vmT4Si/S7mTUPiNTJAs. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added '[compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61]:2222' (ED25519) to the list of known hosts. @.*** ~]#

Skupper has not been involved so far. Now, let's expose the statefulset running on the west cluster and try to access its worker pods from the remote cluster (east).

west $ skupper expose statefulset compss-matmul-4fc9d61-worker --port 2222 --headless statefulset compss-matmul-4fc9d61-worker exposed as compss-matmul-4fc9d61

Looking at the "east" cluster now:

east $ kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES compss-matmul-4fc9d61-worker-0 1/1 Running 0 24s 172.17.44.224 10.240.0.16 compss-matmul-4fc9d61-worker-1 1/1 Running 0 21s 172.17.59.174 10.240.0.4

east $ kubectl get service compss-matmul-4fc9d61 -o wide NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR compss-matmul-4fc9d61 ClusterIP None 2222/TCP 39s internal.skupper.io/service=compss-matmul-4fc9d61

Now that everything is ready, let me run the ssh-client there. Observe that the IP is correct and that I am able to access the SSH server:

east $ kubectl run ssh-client -it --image quay.io/fgiorgetti/rhel9-sshd -- bash

@.*** /]# ping compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61 PING compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61.fg1.svc.cluster.local (172.17.44.224) 56(84) bytes of data. 64 bytes from compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61.fg1.svc.cluster.local (172.17.44.224): icmp_seq=1 ttl=63 time=0.110 ms

@. /]# ssh -p 2222 @. The authenticity of host '[compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61]:2222 ([172.17.44.224]:2222)' can't be established. ED25519 key fingerprint is SHA256:lyyTCcGkE2kYBaaIFUzPVYD1vmT4Si/S7mTUPiNTJAs. This key is not known by any other names Are you sure you want to continue connecting (yes/no/[fingerprint])? yes Warning: Permanently added '[compss-matmul-4fc9d61-worker-0.compss-matmul-4fc9d61]:2222' (ED25519) to the list of known hosts. Last login: Wed Nov 20 14:53:37 2024 from 10.244.5.215 @.*** ~]#

Would you be able to try again using the modified YAMLs (with port 2222 instead)?

On Tue, Nov 19, 2024 at 12:57 PM Alba Cañete Garrucho < @.***> wrote:

Further debugging, even though I executed the command skupper -n compss expose statefulset compss-matmul-4fc9d61-worker --headless --port 22, when trying to list the services exposed I get nothing...

@.***:~$ skupper -n compss service status No services defined

— Reply to this email directly, view it on GitHub https://github.com/skupperproject/skupper/issues/1772#issuecomment-2486107089, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYML4XOPC5OEYP6PWA6AP32BNNWLAVCNFSM6AAAAABRIYZVSWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBWGEYDOMBYHE . You are receiving this because you were mentioned.Message ID: @.***>