rootless-containers / usernetes

Kubernetes without the root privileges
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2033-kubelet-in-userns-aka-rootless
Apache License 2.0
865 stars 58 forks source link

Exposing pod service with Ingress #318

Closed vsoch closed 6 months ago

vsoch commented 9 months ago

I know that I need to add additional ports to the docker-compose.yaml for them to be exposed (e.g., a service running in a pod). I did this for several, and tried both with and without https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml but I can't seem to access the service. For detail, the service (confirmed working with this setup on my local machine, and inside the pod via a curl to localhost) should be exposed with this service setup:

apiVersion: v1
kind: Service
metadata:
  name: ml-service
spec:
  selector:
    run: ml-service
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 8080
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ml-ingress
spec:
  rules:
  - host: localhost
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: ml-service
            port: 
              number: 8080

And the selector for the pod:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-server
spec:
  selector:
    matchLabels:
      run: ml-service

But it can't seem to see it:

$ curl  localhost/api/
curl: (7) Failed to connect to localhost port 80 after 0 ms: Connection refused

$ curl -k localhost:8080/api/
curl: (7) Failed to connect to localhost port 8080 after 0 ms: Connection refused

From the inside the pod, this is what we should see (but with the service, without 8080):

image

I'm thinking about this more, and I think I need to force a recreate of the container (I just restarted) so I'll try that and report back! I didn't do that because I was afraid I'd lose the join-command but that's not a big deal to redo. E.g.,

docker compose up -d --force-recreate
vsoch commented 9 months ago

okay tried this on the VM, restarting everything:

echo net.ipv4.ip_unprivileged_port_start=443 >> /etc/sysctl.conf 
echo net.ipv4.ip_unprivileged_port_start=80 >> /etc/sysctl.conf 
echo net.ipv4.ip_unprivileged_port_start=8080 >> /etc/sysctl.conf 
sysctl -p
systemctl daemon-reload
systemctl restart docker

And then tried to recreate the docker compose setup, but I'm still getting this error:

[+] Running 2/2
 ✔ Network usernetes_default   Created                                                                            0.1s 
 ✔ Container usernetes-node-1  Created                                                                            0.1s 
Error response from daemon: driver failed programming external connectivity on endpoint usernetes-node-1 (8e49fdcac74805e5a05c53aea638bfde6e1abeac6e59f673c3e91787000425ad): Error starting userland proxy: error while calling PortManager.AddPort(): cannot expose privileged port 6443, you can add 'net.ipv4.ip_unprivileged_port_start=6443' to /etc/sysctl.conf (currently 8080), or set CAP_NET_BIND_SERVICE on rootlesskit binary, or choose a larger port number (>= 8080): listen tcp4 0.0.0.0:6443: bind: permission denied
make: *** [Makefile:64: up] Error 1

That was a port that was previously working, so likely I need to start fresh, adding these ports to the initial docker-compose and then the system setup and trying the creation from scratch.

vsoch commented 9 months ago

oh i see the issue - that parameter is for when it starts so I snuffed out the lower value! Going to try again and dangerously set it to 0 (don't worry this cluster is extremely isolated).

vsoch commented 9 months ago

okay update - no issue with usernetes setup (e.g., I've added the ports and started and no error messages) and now instead of not connecting I'm just seeing an empty response:

$ curl -v http://localhost/api
*   Trying 127.0.0.1:80...
* Connected to localhost (127.0.0.1) port 80 (#0)
> GET /api HTTP/1.1
> Host: localhost
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Empty reply from server
* Closing connection 0
curl: (52) Empty reply from server

My service, and ingress:

$ kubectl  describe ingress 
Name:             ml-ingress
Labels:           <none>
Namespace:        default
Address:          localhost
Ingress Class:    <none>
Default backend:  <default>
Rules:
  Host        Path  Backends
  ----        ----  --------
  localhost   
              /   ml-service:8080 (10.244.1.2:8080)
Annotations:  <none>
Events:
  Type    Reason  Age                    From                      Message
  ----    ------  ----                   ----                      -------
  Normal  Sync    3m43s (x2 over 3m43s)  nginx-ingress-controller  Scheduled for sync
$ kubectl  describe svc ml-service 
Name:              ml-service
Namespace:         default
Labels:            <none>
Annotations:       <none>
Selector:          run=ml-service
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.96.121.83
IPs:               10.96.121.83
Port:              <unset>  8080/TCP
TargetPort:        8080/TCP
Endpoints:         10.244.1.2:8080
Session Affinity:  None
Events:            <none>

I'll keep thinking about what the empty response might mean. If you have ideas let me know what we might try!

vsoch commented 9 months ago

okay I had another idea, just for debugging! I looked up the node where the pod is running with our service:

$ kubectl get pods -o wide
NAME                         READY   STATUS    RESTARTS   AGE   IP           NODE           NOMINATED NODE   READINESS GATES
ml-server-55d6f7b4c5-s6ffw   1/1     Running   0          96s   10.244.4.2   u7s-u2204-05   <none>           <none>

Then shelled in, and tested that I could reach the service via the pod ip (I could)

$ make shell
docker compose exec -e U7S_HOST_IP=192.168.65.125 -e U7S_NODE_NAME=u7s-u2204-05 -e U7S_NODE_SUBNET=10.100.177.0/24 node bash

And then explicitly pinged the pod address, and that worked.

# curl -k 10.244.4.2:8080/api/ | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   257  100   257    0     0   125k      0 --:--:-- --:--:-- --:--:--  250k
{
  "id": "django_river_ml",
  "status": "running",
  "name": "Django River ML Endpoint",
  "description": "This service provides an api for models",
  "documentationUrl": "https://vsoch.github.io/django-river-ml",
  "storage": "shelve",
  "river_version": "0.21.0",
  "version": "0.0.21"
}

So that tells us that everything is running OK in the pod, but there is an issue with the service exposure. So exiting from there, now looking at what docker compose is mapping:

$ docker compose ps
WARN[0000] The "U7S_HOST_IP" variable is not set. Defaulting to a blank string. 
WARN[0000] The "U7S_NODE_NAME" variable is not set. Defaulting to a blank string. 
WARN[0000] The "U7S_NODE_SUBNET" variable is not set. Defaulting to a blank string. 
NAME               IMAGE            COMMAND                                                     SERVICE   CREATED          STATUS          PORTS
usernetes-node-1   usernetes-node   "/u7s-entrypoint.sh /usr/local/bin/entrypoint /sbin/init"   node      32 minutes ago   Up 32 minutes   0.0.0.0:80->80/tcp, :::80->80/tcp, 0.0.0.0:443->443/tcp, :::443->443/tcp, 0.0.0.0:2379->2379/tcp, :::2379->2379/tcp, 0.0.0.0:6443->6443/tcp, :::6443->6443/tcp, 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp, 0.0.0.0:10250->10250/tcp, :::10250->10250/tcp, 0.0.0.0:8472->8472/udp, :::8472->8472/udp

We can see it's mapping port 8080, so we should be able to access the service from outside of that container?

$ curl -k localhost:8080/api/
curl: (52) Empty reply from server

That didn't work, along with all the derivatives. So the issue seems to be the service from inside docker-compose being exposed to the VM running the container. It actually seems to be present (it's not that it doesn't exist) but the reply is empty. And also just to clarify (because this is a common bug) the server is running from 0.0.0.0 and not localhost or 127.0.0.1.

vsoch commented 9 months ago

@AkihiroSuda my colleague had an insight that gave us (at least a solution for now) that allows us to ping the hostname running the pod directly! The missing piece was defining the hostPort, here is the diff for the relevant section.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-server
spec:
  selector:
    matchLabels:
      run: ml-service
  replicas: 1
  template:
    metadata:
      labels:
        run: ml-service
    spec:
      containers:
      - name: ml-service
        image: ghcr.io/converged-computing/lammps-stream-ml:test-server
        # These should be secrets, but OK to test
+        # EXTREMELY IMPORTANT: we need to set the host port so it's mapped to the same as usernetes
        ports:
        - containerPort: 8080
+         hostPort: 8080
        - containerPort: 80
+         hostPort: 80

And then we can hit that endpoint from any node, explicitly targeting the host and port:

image

It would be good to eventualy figure out a solution so this just works across localhost, but this should work for us for now since we are prototyping the setup for basic experiments. Thanks to @milroy for figuring this out - we are unblocked with this fix! :tada:

Also @AkihiroSuda this work with usernetes is super cool and coming along quite nicely, and we have you to thank for that! I'm going to be sharing a tiny bit of it at FOSDEM in early February if you are interested. It's a big open source conference (and this is a DevRoom) so likely you've heard of it, but I wanted to share so we can have good collaboration across our HPC and cloud communities.

AkihiroSuda commented 9 months ago

Nice 👍 , I'm not likely going to FOSDEM this year, but I'll check the slides online

vsoch commented 6 months ago

Wow time flies - thanks again for your help on this @AkihiroSuda ! I thought of it because I'm running this again, just on slightly larger / better infrastructure (network and scale wise). To return to our last correspondence, for those interested in the talk, it's the Bare Metal Bros and was really fun to do - we are hoping to extend this to a reproducible setup for others to use (actually I'm mostly done with that as of this week, just tidying it up for our own experiments). AWS uses the elastic fiber adapter, and getting that working with EFA and usernetes took some elbow grease!

I can confirm this strategy to expose the service via the docker-compose, and using hostPort, (still) works like a charm, but now on AWS with efa. I'm good to close, thank you again! And thank you for everything that you do for our communities, it's really admirable and inspriing.