Open Ph0tonic opened 1 year ago
Did you run into this in a GKE based cluster using Cilium via GCP's dataplane v2, or was this a cluster setup in another way?
Ok so I do not think that it is a GKE based cluster, sorry, I am not familiar with cluster but what I found is that the runtime engine is containerd://1.6.15-k3s1 and cilium is configured.
Ah, its a k3s based cluster. Then i think the main issue is that network policies are enforced at all (cilium, calico), but that access is restricted to the k8s internals there but not in other clusters.
@Ph0tonic existing core network policy takes care of kube api server egress on GKE. I have been testing JupyterHub on GKE Auopilot for a few weeks now and do not see any other issues so far. You can check the details in my post, note the K8sAPIServer
I have not installed k3s and tested but I think changing the server port to 443 should resolve this issue without any additional policy. I am including the reference links below
[1]https://kubernetes.io/docs/concepts/security/controlling-access/#transport-security
Thanks @vizeit, I will have a look at these configurations and see if it fixes my problem.
So I looked at your link and it the difference between 443 and 6443 was not really clear to me. I found https://github.com/kubernetes/website/issues/30725 which clarifies this. So from my understanding, 443 should be used as an exposed external port.
I see 2 possibilities :
@Ph0tonic Were you able to test with port 443 to confirm that it works with the existing core network policy?
I can reproduce this problem with Cilium on a bare-metal cluster. Disabling the hub
NetPol in the Helm chart is my workaround so far.
Access to the API server from pods inside the cluster goes through https://kubernetes.default:443
and I can only curl that from within the jupyterhub
container, if the NetPol is disabled (and only then, JH is working properly).
The kubernetes.defailt
service has a ClusterIP of 10.233.0.1
. The NetPol is quite hard to read since there are many overlapping rules. However, looking at it in https://editor.networkpolicy.io/, I cannot find a rule that would allow traffic to this IP (unfortunately, I can't post the image).
Hi, Sorry @vizeit for the late reply. I did not had the possibility and rights to change the cluster config from 6443 to 443, so I could not test it.
The solution which work for me is the following config:
hub:
networkPolicy:
egress:
- ports:
- port: 6443
Hi, Sorry @vizeit for the late reply. I did not had the possibility and rights to change the cluster config from 6443 to 443, so I could not test it.
The solution which work for me is the following config:
hub: networkPolicy: egress: - ports: - port: 6443
@Ph0tonic no problem
Add some documentation to clarify the need for this egress rule.
Trying to clarify this:
egress:
- to:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 10.0.0.0/8
- 172.16.0.0/12
- 192.168.0.0/16
- 169.254.169.254/32
kubernetes.default
domain name used by the hub always resolves to an IP from the private range, e.g. 10.96.0.1
. The public IP range of the Kubernetes API endpoint may also be from one of the private IP ranges, see e.g. Anatomy of the kubernetes.defaulthub
to connect to the Kubernetes API.The following egress rule mentioned by @Ph0tonic works, but it allows connections to any host on port 6443, not only the Kubernetes API:
hub:
networkPolicy:
egress:
- ports:
- port: 6443
Alternatively, a CiliumNetworkPolicy can be used to filter traffic specifically from the hub
pod to the Kubernetes API:
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: allow-access-to-api-server
namespace: jupyterhub-test
spec:
egress:
- toEntities:
- kube-apiserver
endpointSelector:
matchLabels:
app: jupyterhub
component: hub
Also note that the same policy should be added also for the image-puller
and user-scheduler
components for which the chart does not specify any network policy. This is important especially when you want to add a default deny-all policy for the namespace.
Bug description
Default Kube-Spwaner is not able to spawn any user pod, it fails while attempting to create the PVC with a
TimeoutError
.Expected behaviour
Should be able to able to spawn pods.
Analysis
After some research, I identified that my problem was linked with the
netpol
egress config of the hub. Here are a few cilium logs of dropped packets :After some research I identified that destination addresses belonged to
kube-apiserver
,kube-proxy
andkube-controller-manager
.To fix the issue I identified that the problem lay in the egress and not in the ingress part. And managed to find a fix:
The issue is that the hub tries to access the
kube-apiserver
to generate a PVC but the request is blocked by the egress configuration.I am surprised that @vizeit did not have this issue in #3167.
Your personal set up
I am using the latest v3.0.0 version of this helm chart with cilium.
Full environment
``` # paste output of `pip freeze` or `conda list` here ```Configuration
```python # jupyterhub_config.py ```Logs
``` # paste relevant logs here, if any ```