neuhalje commented 6 years ago

The containerised kube-proxy fails to expose services with NodePort because it cannot lock /run/xtables.lock (open /run/xtables.lock: read-only file system). .

Version used

sudo atomic images list
...
>  registry.fedoraproject.org/f27/kubernetes-proxy     latest   68406693c322   2017-12-10 17:26   237.16 MB      ostree

Service definition

Given the following yaml:

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 2
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.8
        ports:
        - containerPort: 80

---
kind: Service
apiVersion: v1
metadata:
  name: my-nginx
spec:
  type: NodePort
  selector:
    app: nginx
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80

kubectl

kubectl get pods

NAME                                READY     STATUS    RESTARTS   AGE
nginx-deployment-3718365652-1ht6q   1/1       Running   1          1h
nginx-deployment-3718365652-9g2jq   1/1       Running   1          1h

kubectl get service my-nginx

NAME       TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
my-nginx   NodePort   10.254.106.83   <none>        80:32315/TCP   1h

Expected behaviour

curl http://172.20.61.51:32315 should return the nginx page.

Observed behaviour

The port is not exposed.

sudo netstat -tulpen | grep proxy

tcp        0      0 127.0.0.1:10249         0.0.0.0:*               LISTEN      994        25072      797/kube-proxy
tcp6       0      0 :::10256                :::*                    LISTEN      994        24025      797/kube-proxy

Although I can connect to the ports of the container:

curl http://127.0.0.1:10249/

404 page not found

journaltctl -xe -u kube-proxy.service returns the following errors:

Failed to start in resource-only container "/kube-proxy": mkdir /sys/fs/cgroup/cpuset/kube-proxy: read-only file system
...
Failed to execute iptables-restore: failed to open iptables lock /run/xtables.lock: open /run/xtables.lock: read-only file system

# the last line is repeated every 30 seconds

ashcrow commented 6 years ago

Thanks for the report @neuhalje! It looks like this is due to the latest version not being available in Fedora as /run should be mounted in from the system: https://github.com/projectatomic/atomic-system-containers/blob/master/kubernetes-proxy/config.json.template#L324-L334

@jasonbrooks can you push through an update?

ashcrow commented 6 years ago

neuhalje commented 6 years ago

Update

I upgraded the system & containers:

I cannot access it from any other system (no reply to SYN packets).
I can access the service from my (single) node
the message has changed (slightly) but the service still logs an error (Failed to start in resource-only container "/kube-proxy": mkdir /sys/fs/cgroup/cpuset/kube-proxy: read-only file system).

Status

Installed Versions

sudo atomic images list | grep proxy
>  registry.fedoraproject.org/f27/kubernetes-proxy     latest   4660f3d3b9a3   2018-01-13 10:13   262.53 MB      ostree

Log

journalctl -xe -u kube-proxy.service
...
-- Unit kube-proxy.service has finished starting up.
--
-- The start-up result is done.
Jan 13 10:11:54 node-1.[redacted] runc[772]: 2018-01-13 10:11:54.524258 I | proto: duplicate proto type registered: google.protobuf.Any
Jan 13 10:11:54 node-1.[redacted] runc[772]: 2018-01-13 10:11:54.537896 I | proto: duplicate proto type registered: google.protobuf.Duration
Jan 13 10:11:54 node-1.[redacted] runc[772]: 2018-01-13 10:11:54.538171 I | proto: duplicate proto type registered: google.protobuf.Timestamp
Jan 13 10:11:54 node-1.[redacted] runc[772]: W0113 10:11:54.934872       1 server.go:190] WARNING: all flags other than --config, --write-config-to, and --cleanup-iptables are deprecated. Please begin using a config file ASAP.
Jan 13 10:11:55 node-1.[redacted] runc[772]: I0113 10:11:55.133819       1 server.go:478] Using iptables Proxier.
Jan 13 10:11:55 node-1.[redacted] runc[772]: W0113 10:11:55.155968       1 proxier.go:488] clusterCIDR not specified, unable to distinguish between internal an
Jan 13 10:11:55 node-1.[redacted] runc[772]: I0113 10:11:55.156343       1 server.go:513] Tearing down userspace rules.
Jan 13 10:11:55 node-1.[redacted] runc[772]: W0113 10:11:55.475478       1 server.go:628] Failed to start in resource-only container "/kube-proxy": mkdir /sys/fs/cgroup/cpuset/kube-proxy: read-only file system
Jan 13 10:11:55 node-1.[redacted] runc[772]: I0113 10:11:55.476775       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
Jan 13 10:11:55 node-1.[redacted] runc[772]: I0113 10:11:55.476989       1 conntrack.go:52] Setting nf_conntrack_max to 131072
Jan 13 10:11:55 node-1.[redacted] runc[772]: I0113 10:11:55.477159       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to
Jan 13 10:11:55 node-1.[redacted] runc[772]: I0113 10:11:55.477307       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3
Jan 13 10:11:55 node-1.[redacted] runc[772]: I0113 10:11:55.478839       1 config.go:202] Starting service config controller
Jan 13 10:11:55 node-1.[redacted] runc[772]: I0113 10:11:55.478880       1 config.go:102] Starting endpoints config controller
Jan 13 10:11:55 node-1.[redacted] runc[772]: I0113 10:11:55.524651       1 controller_utils.go:994] Waiting for caches to sync for service config controller

Nodes

 kubectl get nodes
NAME                               STATUS    ROLES     AGE       VERSION
node-1.[redacted       ]   Ready     <none>    33d       v1.7.3

node-1 has the ip address 172.20.61.51.

Services

The service is running:

 kubectl get service
NAME         TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)        AGE
kubernetes   ClusterIP   10.254.0.1     <none>        443/TCP        33d
my-nginx    NodePort    10.254.17.99   <none>        80:30849/TCP   3h

kubectl describe service my-nginx

Name:                     my-nginx
Namespace:                default
Labels:                   <none>
Annotations:              <none>
Selector:                 app=nginx
Type:                     NodePort
IP:                       10.254.17.99
Port:                     <unset>  80/TCP
TargetPort:               80/TCP
NodePort:                 <unset>  30849/TCP
Endpoints:                172.17.0.2:80,172.17.0.3:80
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

 kubectl get pods
NAME                                READY     STATUS    RESTARTS   AGE
nginx-deployment2-540558622-9zxwt   1/1       Running   1          3h
nginx-deployment2-540558622-jzjv0   1/1       Running   1          3h

Analysis

Log

Compared to the old output the first message is still logged but Failed to execute iptables-restore: failed to open iptables lock /run/xtables.lock: open /run/xtables.lock: read-only file system is no longer logged:

Failed to start in resource-only container "/kube-proxy": mkdir /sys/fs/cgroup/cpuset/kube-proxy: read-only file system
...

Access service from the node works

On the node the service can be accessed (curl http://172.20.61.51:30849 succeeds).

Access service from other systems does not work

When I access the service from my laptop the service cannot be accessed (curl http://172.20.61.51:30849 hangs).

tcpdump shows that my host gets no reply for the initial SYN packet:

# on the host node-1
sudo tcpdump -nn port 30849
...
13:55:15.300884 IP 172.20.10.50.54187 > 172.20.61.51.30849: Flags [S], seq 1814234020, win 65535, options [mss 1460,nop,wscale 5,nop,nop,TS val 1060856956 ecr 0,sackOK,eol], length 0
13:55:16.303943 IP 172.20.10.50.54187 > 172.20.61.51.30849: Flags [S], seq 1814234020, win 65535, options [mss 1460,nop,wscale 5,nop,nop,TS val 1060857956 ecr 0,sackOK,eol], length 0
...

Firewall

iptables has rules for the service:

# on the host node-1
sudo iptables -n -L -t nat
....
Chain KUBE-NODEPORTS (1 references)
target     prot opt source               destination
KUBE-MARK-MASQ  tcp  --  0.0.0.0/0            0.0.0.0/0            /* default/my-nginx: */ tcp dpt:32474
KUBE-SVC-BEPXDJBUHFCSYIC3  tcp  --  0.0.0.0/0            0.0.0.0/0            /* default/my-nginx: */ tcp dpt:32474

...

Chain KUBE-SEP-BLX3X6UTIG6UGCA2 (1 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  172.17.0.5           0.0.0.0/0            /* default/my-nginx: */
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            /* default/my-nginx: */ tcp to:172.17.0.5:80

...

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-MARK-MASQ  tcp  -- !172.17.0.0/16        10.254.0.1           /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  0.0.0.0/0            10.254.0.1           /* default/kubernetes:https cluster IP */ tcp dpt:443
KUBE-MARK-MASQ  tcp  -- !172.17.0.0/16        10.254.168.195       /* default/my-nginx: cluster IP */ tcp dpt:80
KUBE-SVC-BEPXDJBUHFCSYIC3  tcp  --  0.0.0.0/0            10.254.168.195       /* default/my-nginx: cluster IP */ tcp dpt:80
KUBE-MARK-MASQ  tcp  -- !172.17.0.0/16        10.254.210.142       /* ingress-nginx/default-http-backend: cluster IP */ tcp dpt:80
KUBE-SVC-J4PGGZ6AUXZWNA2B  tcp  --  0.0.0.0/0            10.254.210.142       /* ingress-nginx/default-http-backend: cluster IP */ tcp dpt:80
KUBE-NODEPORTS  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

Chain KUBE-SVC-BEPXDJBUHFCSYIC3 (2 references)
target     prot opt source               destination
KUBE-SEP-BLX3X6UTIG6UGCA2  all  --  0.0.0.0/0            0.0.0.0/0            /* default/my-nginx: */ statistic mode random probability 0.50000000000
KUBE-SEP-J5WBW7HEOGAHN6ZG  all  --  0.0.0.0/0            0.0.0.0/0            /* default/my-nginx: */

...
# Outbound

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
MASQUERADE  all  --  172.17.0.0/16        0.0.0.0/0
KUBE-POSTROUTING  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */

...

Chain KUBE-POSTROUTING (1 references)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ mark match 0x4000/0x4000

neuhalje commented 6 years ago

@ashcrow Is this a bug or a setup problem on my side?

ashcrow commented 6 years ago

@jasonbrooks ^^

jasonbrooks commented 6 years ago

I'm looking into this

jasonbrooks commented 6 years ago

@neuhalje It might be a setup problem on your side. I'm testing this on a three node cluster with system containers installed, and my nodeport is exposed on each of my nodes, and I'm able to curl the nginx server.

I am getting the /sys/fs/cgroup/cpuset/kube-proxy: read-only file system error as well. The system containers for the openshift origin node (https://github.com/openshift/origin/blob/release-3.7/images/node/system-container/config.json.template), which cover the kubelet and the proxy components, bind /sys rw, we could take that approach, or we could change our ro bind of /sys/fs/cgroup to rw.

A wider issue is that we need to update / refine our suggested kubernetes setup process. I've always used https://github.com/kubernetes/contrib/tree/master/ansible, but those scripts have been deprecated for a different ansible-based approach that doesn't use these system containers at all.

I think it might make sense to try and work out upstream kube master and node roles that work with https://github.com/openshift/openshift-ansible.

neuhalje commented 6 years ago

@jasonbrooks Aligning installation and configuration with other projects is a good idea.

I will close the issue because with the updated containers it very likely is a layer 8 problem on my side. Thank you for looking into this!

deuscapturus commented 6 years ago

I've hit this same issue.

Able to connect to tutor-proxy nodePort locally, but not remotely. I'm running the latest available version of the kube-proxy system container from registry.fedoraproject.org/f27/kubernetes-proxy.

kube-proxy output:

Feb 13 18:40:45 ip-10-107-20-177.us-west-2.compute.internal systemd[1]: Started kubernetes-proxy.
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: 2018-02-13 18:40:46.089456 I | proto: duplicate proto type registered: google.protobuf.Any
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: 2018-02-13 18:40:46.089550 I | proto: duplicate proto type registered: google.protobuf.Duration
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: 2018-02-13 18:40:46.089570 I | proto: duplicate proto type registered: google.protobuf.Timestamp
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: W0213 18:40:46.133750       1 server.go:190] WARNING: all flags other than --config, --write-config-to, and --cleanup-iptables are deprecated. Please begin using a config file ASAP.
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: I0213 18:40:46.140159       1 server.go:478] Using iptables Proxier.
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: W0213 18:40:46.145704       1 proxier.go:488] clusterCIDR not specified, unable to distinguish between internal and external traffic
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: I0213 18:40:46.146164       1 server.go:513] Tearing down userspace rules.
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: W0213 18:40:46.156672       1 server.go:628] Failed to start in resource-only container "/kube-proxy": mkdir /sys/fs/cgroup/cpuset/kube-proxy: read-only file system
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: I0213 18:40:46.157028       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: I0213 18:40:46.157116       1 conntrack.go:52] Setting nf_conntrack_max to 131072
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: I0213 18:40:46.157264       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: I0213 18:40:46.157307       1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: I0213 18:40:46.157646       1 config.go:202] Starting service config controller
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: I0213 18:40:46.157662       1 controller_utils.go:994] Waiting for caches to sync for service config controller
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: I0213 18:40:46.157701       1 config.go:102] Starting endpoints config controller
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: I0213 18:40:46.157708       1 controller_utils.go:994] Waiting for caches to sync for endpoints config controller
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: I0213 18:40:46.257846       1 controller_utils.go:1001] Caches are synced for endpoints config controller
Feb 13 18:40:46 ip-10-107-20-177.us-west-2.compute.internal runc[6281]: I0213 18:40:46.257870       1 controller_utils.go:1001] Caches are synced for service config controller

ashcrow commented 6 years ago

Reopening. @jasonbrooks can you reproduce?

deuscapturus commented 6 years ago

It has been stated that this issue will be resolved with https://github.com/projectatomic/atomic-system-containers/commit/2d50826014e28d496252060b4739643e1c5ce425

But I have doubt that the above fix applies to the kubernetes-proxy system container. It looks like it only applies to the kubelet container.

jasonbrooks commented 6 years ago

@deuscapturus Right, I'm going to test adding a similar fix in the kube-proxy container

jasonbrooks commented 6 years ago

@deuscapturus So, I tested the change, and it got rid of the error, but I'm able to access my nodeport from a separate system with or without the change.

I can try to reproduce what you're seeing, do you have a test manifest or something I can try

deuscapturus commented 6 years ago

My problem is somewhere in iptables. I'm able to connect to my service externally on the nodePort when I change kube-proxy to --proxy-mode=userspace.

@jasonbrooks as your test suggests the ro filesystem error/warning is an entirely different issue. Would you prefer a new issue or to change the title on this one?

jasonbrooks commented 6 years ago

@deuscapturus we can keep this issue. I'm curious if you install the and run the proxy from the rpm, will you still have this issue. The following command will do it. I'm including a dl of the particular package because the current latest kube in f27 is 1.9.1, but a system container w/ that version hasn't been released yet.

atomic uninstall kube-proxy &&  curl -O https://kojipkgs.fedoraproject.org//packages/kubernetes/1.7.3/1.fc27/x86_64/kubernetes-node-1.7.3-1.fc27.x86_64.rpm && rpm-ostree install kubernetes-node-1.7.3-1.fc27.x86_64.rpm -r

projectatomic / atomic-system-containers

kube-proxy fails to expose NodePort because of r/o filesystem #155

Version used

Service definition

kubectl

Expected behaviour

Observed behaviour

Status

Installed Versions

Log

Nodes

Services

Analysis

Log

Access service from the node works

Access service from other systems does not work

Firewall