gitpod-io / gitpod-eks-guide

This repo is being deprecated in favor of the single cluster reference architecture and the corresponding Terraform config.
https://www.gitpod.io/docs/configure/self-hosted/latest/reference-architecture/single-cluster-ref-arch
MIT License
51 stars 18 forks source link

Used the instructions from this repo but can't run docker in workspace #25

Closed hanfi closed 2 years ago

hanfi commented 3 years ago

Bug description

hello i followed the instructions on installing on eks using this repo,

everything works as it's supposed to be but when i try custom image on .gitpod.yml or simply run a "docker run -ti ubuntu" on the workspace terminal, i got this error :

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "proc" to rootfs at "/proc" caused: mount through procfd: operation not permitted: unknown. ERRO[0004] error waiting for container: context canceled

i tried rebuilding the builder image with different tags but can't make it work :'(

Steps to reproduce

make install this repository then create a workspace and try to run a container

Expected behavior

container starts and can use custom images on workspace

Example repository

have the issue with this repo: https://github.com/gitpod-io/template-docker-compose

Anything else?

No response

szab100 commented 3 years ago

I am having similar issues when trying to run anything in Docker within workspaces on a new deployment (installed 2 days ago at 442119fca4c2ee7cefe57e2057be5ffee6256fdf using the west-2 region's AMI "ami-0a9aa973650d0c831"). Tried with both the latest workspace-full base having Docker v20. and older tags having Docker v19..

Here is my comment with the exact outputs on the Docker v20 GH issue: https://github.com/gitpod-io/gitpod/issues/3051#issuecomment-932460520

I also tried Podman (install script), but it is having CNI issues:

root@ws-b259b910-a141-4b87-b7fe-7ecfe48da696# export STORAGE_DRIVER=vfs
root@ws-b259b910-a141-4b87-b7fe-7ecfe48da696# podman run hello-world
Resolved "hello-world" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull docker.io/library/hello-world:latest...
Getting image source signatures
Copying blob 2db29710123e done  
Copying config feb5d9fea6 done  
Writing manifest to image destination
Storing signatures
WARN[0002] unable to reset namespace: "Error switching to ns /proc/15214/task/15221/ns/net: operation not permitted" 
ERRO[0002] error loading cached network config: network "podman" not found in CNI cache 
WARN[0002] falling back to loading from existing plugins on disk 
Error: error configuring network namespace for container 3f4d9d026d003dbcb9e0291838717292e4ebcaea7bf1b8151833635d9716fa51: error adding pod wonderful_bassi_wonderful_bassi to CNI network "podman": failed to create bridge "cni-podman0": could not add "cni-podman0": operation not permitted

@aledbf @ghuntley any thoughts? has docker ever worked on aws?

aledbf commented 3 years ago

@szab100 I need to release a new version of the AMI and gitpod images. The procfd mount issue is already fixed in main. I hope I can release an update next week. (I am swamped right now)

szab100 commented 3 years ago

@aledbf Thank you for the quick reply! Next week would be awesome, but sure I understand you are super busy. One additional note for this is that I have also tried to run minikube / k3s / kind inside the workspace (on saas gitpod.io as docker doesn't work on my aws install), and those are failing with mount issues similar to the procfd with podman above:

mounting "cgroup" to rootfs at "/sys/fs/cgroup" caused: operation not permitted: unknown.

Can this be because of some missing securityContext capabilities on the ws-<..> pod, eg SYS_ADMIN? I know it is a whole different story than what this bug is about.. one of our projects has argo workfows as a dependency and my colleague is going to give a speech on KubeCon about it.. It would be nice to give the project GitPod support (preferably on gitpod.io), so interested developers could quickly have a dev environment up & running to test / contribute.. but most probably this cannot happen as KubeCon is less than 2 weeks away..

Either way, we need this to get fixed regardless of KubeCon, and even if local k8s support won't be possible, this docker issue is blocking us from proposing gitpod as a (self-hosted) cloud-ide solution at the company.

aledbf commented 3 years ago

One additional note for this is that I have also tried to run minikube / k3s / kind inside the workspace

This is still not possible. Please follow https://github.com/gitpod-io/gitpod/issues/4889 Some of the restrictions are related to the user namespace where the workspace run. Please check the feature description

and those are failing with mount issues similar to the procfd with podman above:

Right now containerd is a hard requirement and there is no support for podman.

szab100 commented 3 years ago

Thank you @aledbf, Lorenzo's workaround from this issue helped me a lot! I made a fork where the initial workspace build time is greatly reduced by using a pre-baked middle-layer workspace image that has QEMU installed and the initial VM root filesystem fully baked. This middle-layer image can be possibly replaced with enabling prebuild using this Dockerfile.

Certainly not perfect, mostly b/c qemu is slow to start and has quite a big overhead (it can only run in full sw-emulation mode, it would need '/dev/kvm' passthrough from the node), but seems to be a great workaround for running a local k8s cluster, until we can get full support for 'kind'.

Hope to see your fixes for Docker on the AWS deployment soon. Thanks!

szab100 commented 3 years ago

Hi @aledbf, just a friendly reminder on this.. Is this aws/eks installation method planned to be supported on the same level as the head / helm-based releases? As I see it needs the AMI images to be maintained, as well as it is installing pinned versions:

// TODO: switch to official gitpod.io build. const version = "aledbf-mk3.68";

I am just trying to figure out what installation we should choose for long term. The pure helm-based install might work better for us (despite we had some ingress/dns issues initially), however I think features like docker support would require more low-level requirements that simply wouldn't work on our own managed eks clusters, without using the custom AMI images.

aledbf commented 3 years ago

Is this aws/eks installation method planned to be supported on the same level as the head / helm-based releases?

This repository contains two things, provisioning of cloud resources and installation of Gitpod. The first part will not change, i.e., we will continue using CDK. For the Gitpod part, we are working on a new installer https://github.com/gitpod-io/gitpod/tree/main/installer . Unfortunately, it is not helm-based. The main reason for that is the helm chart is too hard to maintain for so many different environments and options it contains.

without using the custom AMI images.

Now that containerd is an option in EKS, we could try to use an AMI image. That said, that is not in the roadmap right now. Our focus is the installer to enable shorter release cycles of Gitpod self-hosted.

szab100 commented 3 years ago

@aledbf Thank you for the reply. Does this mean in order to get a new release in this repo (for direct AWS/EKS install) that supports docker-in-docker we need to wait till the installer is completed? If so, what is the ETA for this new installer?

About the installer, I see it is written in Go and is still using helm-chart under the hood for third_party components.. Overall I am a bit worried that by dropping Helm or hard-coding values used for 3rd party components will limit users with customizations.

Another, very important important area where I think the overall architecture & the installer should be improved is the internal (intra-component) communication vs. ingress / DNS names. I saw that many folks ran into these issues already. The main problem is that Gitpod is relying on NodePorts and the single vhost-based 'proxy' federation / rev-proxy service and even worse, it uses the main external DNS names for internal communication. Using NodePort services are restricted on most managed enterprise kubernetes environments, like here at Intuit and therefore we must use ingress or alb / elb load-balancers to expose services externally. The problem with these is that in most cases, internal components cannot directly talk to their external IPs from inside the cluster.

I found a workaround for this, but it is very hacky: I created both an internal and external ingress/lb, both are assigned an fqdn hostname by aws. The external DNS names (<dom>, *.<dom>, *.ws.<dom>) are CNAME records pointing to the external lb's fqdn hostname. For internal access, I needed to add rewrite rule(s) to the cluster's CoreDNS 'Corefile' in order to spoof the external DNS records' CNAMEs to point to the internal ingress' fqdn hostname, when they are resolved from inside the cluster.

I believe that all these hacks could be avoided and GitPod self-hosted would be much more compatible with various k8s environments if it was using the standard k8s in-cluster service dns names (eg proxy.<namespace>.svc.cluster.local) to communicate internally between components, dedicating the external dns names to external user requests. I understand that for simplicity reasons, Gitpod is using the same nginx vhost reverse proxy internally as for external requests. Perhaps, the simplest option is to add some tiny forwarding dns server pod, which forwards everything, except the main (external) DNS entries, what it simply overrides and returns as a CNAME pointing to proxy.<ns>.svc.cluster.local and making all internal components use this forwarder DNS server. This way, any internal pods can talk to others using the external DNS name, but users do not need to make custom, cluster-wide tweaks, such as adding CoreDNS rewrite entries.

(Note: pointing the ext DNS to proxy.<namespace>.svc.cluster.local only works when HTTPS is done by the 'proxy' service itself, eg with let's encrypt cert. If https termination is done by a load-balancer / ingress, internal communication should either use plain HTTP or there should be an internal alb/elb ingress as well and point the external CNAMEs to the ingress' FQDN hostname instead.. Another way would be to create another service, eg 'proxy-int' with a self-signed cert generated through installation and trusted by all components and point the wildcard CNAMEs to that.)

Hint: this is something where Eclipse Che is having major / blocking issues as well. Although their approach is now to use the internal K8S DNS names for internal comm, not all of their components support this so far.. Please do not let Gitpod fall behind and make it unavailable for enterprise k8s users, who can't use your overly simplified NodePort + ext DNS approach, when a convenient fix is relatively easy to implement on your side. Please route this request internally to whoever is responsible for this area, as I believe this is currently a major blocker for many potential users of Gitpod. Thanks.

aledbf commented 3 years ago

About the installer, I see it is written in Go and is still using helm-chart under the hood for third_party components.. Overall I am a bit worried that by dropping Helm or hard-coding values used for 3rd party components will limit users with customizations.

We can't continue using helm. The amount of logic in the templates is one of the reasons why we don't have more frequent releases.

The main problem is that Gitpod is relying on NodePorts and

The nodeports you see now will be removed once the installer is available. In particular the registry-facade one.

the single vhost-based 'proxy' federation / rev-proxy service and even worse

This will not change. Because of the dynamic Gitpod behavior, we cannot use an Ingress controller.

it uses the main external DNS names for internal communication.

This will be solved after https://github.com/gitpod-io/gitpod/pull/4901

Gitpod is using the same nginx vhost reverse proxy internally as for external requests.

We don't use nginx anymore. The entry point for a cluster is the proxy component that uses Caddy instead. From there we send the traffic to ws-proxy, which is the place where all the dynamic routing behavior is implemented.

what it simply overrides and returns as a CNAME pointing to proxy..svc.cluster.local and making all internal components use this forwarder DNS server.

We want to remove the use of services and DNS - https://github.com/gitpod-io/gitpod/issues/6325

About the installer, please follow https://github.com/gitpod-io/gitpod/milestone/16 for the MVP

szab100 commented 3 years ago

Sorry for the lengthy problem description / suggestion.. I see (and hope) things are going into the right direction.

the single vhost-based 'proxy' federation / rev-proxy service and even worse

This will not change. Because of the dynamic Gitpod behavior, we cannot use an Ingress controller.

^^ I don't fully understand this though. Why would the dynamic DNS entries prevent us from using an Ingress controller? My AWS installation is working flawlessly with this setup (except the Docker-in-Docker, but that is due to the outdated AMI):

The 3 external DNS entries (gitpod.<domain>, *.gitpod.<domain>, *.ws.gitpod.<domain>) are CNAMEs, not A records pointing to the single external DNS FQDN hostname provided by the Ingress controller.. So any dynamic gitpod hostnames work perfectly fine, since they all point to the single ingress hostname and its IP addresses. But for internal access, I needed to create an internal Ingress too (alb.ingress.kubernetes.io/scheme=internal), which provides a separate single, externally resolvable DNS FQDN hostname, but unlike the external (scheme=internet-facing) ingress' hostname, this one points to in-cluster IP addresses, therefore accessible to the pods. You just need to make sure the external DNS names resolve to CNAMEs pointing to this ingress hostname from inside the pods, that is what I needed the CoreDNS rewrite module for (but this can be done locally in the namespace too, with a small forwarding DNS server pod that other pods can use as a DNS server).

So my currently working AWS setup looks like this:

  1. External requests: Users ==> DNS resolution of whatever.gitpod.<domain> ==> CNAME to the Ingress hostname ==> Ingress (external, ALB) ==> 'proxy' service

  2. Internal requests: Internal Component ==> DNS resolution of whatever.gitpod.<domain> ==> CoreDNS spoof's DNS response, rewriting the CNAME to point to the internal Ingress hostname ==> Ingress (internal, ALB) ==> 'proxy' service

Note: Both the external and internal Ingresses are terminating HTTPS using the same certificate ARN. If the HTTPS termination is done by the 'proxy' service itself as in the Installation howto, there is no need for this internal ingress and CNAMEs resolved from inside the cluster can simply point to proxy.<namespace>.svc.cluster.local.

aledbf commented 3 years ago

Why would the dynamic DNS entries prevent us from using an Ingress controller? My AWS installation is working flawlessly with this setup (except the Docker-in-Docker, but that is due to the outdated AMI):

Most of the Gitpod traffic uses Websockets and some ingress controllers do not handle that well, meaning, the connection is terminated on Ingress changes. Keep in mind we need to think in several scenarios, not only EKS.

About the DNS part, we want to avoid DNS and services in order to simplify the number of variables involved, remove potential DNS timeouts, and the additional load on kube-proxy side (iptables).

But for internal access, I needed to create an internal Ingress too (alb.ingress.kubernetes.io/scheme=internal), which provides a separate single, externally resolvable DNS FQDN hostname, but unlike the external (scheme=internet-facing) ingress' hostname, this one points to in-cluster IP addresses, therefore accessible to the pods

Do you mean traffic from inside the VPC?

szab100 commented 3 years ago

Do you mean traffic from inside the VPC?

Yes, the internal LB / ingress is having assgined IPs from the cluster's VPC subnets of each AZs. I needed to add the VPC's overall CIDR (192.168.0.0/16 in my case) into LoadBalancerSourceRanges or directly to its SecurityGroup to let the pods reach it.

OK sure I understand if websockets have issues through LBs, while I personally had no such issues. If the connection drops are limited to Ingress changes, I think that's normal and should be expected, providing reasonable client-side resiliency should be enough to handle those. But I understand there could be various issues with other providers' exotic ingress controllers that can cause bigger disruptions.

Still, even just mentioning this alternative deployment option as a less- or not-supported one might help increasing adoption, in case people are not allowed to expose NodePorts on their clusters directly due to internal policies. So far everyone else I have seen from the community experimenting with ingresses with Gitpod eventually gave up and it took me several days to come up with this workaround. So let me know if you guys need any more details, I am happy to share more details or contribute with a howto (even the CoreDNS part is very simple, just editing a configmap).

szab100 commented 3 years ago

Hi @aledbf, going back to the original topic in this issue, can we please get a fix for this? I track the new installer's milestone, but it seems to progress a bit slowly.. which is fine for something completely new, but this is an important existing feature that is currently broken on AWS installs.. and if I understood correctly, it has nothing to do with the installer itself, but the used AMI images & possibly ws pods configuration instead. I was planning to showcase Gitpod self-hosted on our bi-annual hackathon week (starts next week) and push for company-wide adoption, but this seems to be a major blocker. I see others are waiting for this as well.

aledbf commented 3 years ago

@szab100 last week we merged the refactoring I mentioned in a previous comment https://github.com/gitpod-io/gitpod/pull/6462 With this change, we reduced the median startup time by half and got instant port exposure (no more waiting because “port was not found”) and massively reduced load on the API server.

About the original issue, we need to release a new version. That's it. The problem is the self-hosted release implies too apply too many changes, and the reason why we need to wait for the installer. Right now, the installer is running for GKE with internal dependencies. By the end of the week will be ready for external ones (registry and DB). With that done, the idea is to update the EKS and AKS guides.

I am sorry for the delay. We cannot tackle all the fronts simultaneously.

szab100 commented 3 years ago

@aledbf Awesome, no worries, thank you for the quick update! I love there is so much progress overall, sorry I just could not connect these pieces like you just did, it makes complete sense now and I cant wait to see these performance improvements as well. So you are saying GCP might be completed this week. Hope it's relatively quick to adapt it to EKS thereafter.