gitpod-io / gitpod

The developer platform for on-demand cloud development environments to create software faster and more securely.
https://www.gitpod.io
GNU Affero General Public License v3.0
12.82k stars 1.23k forks source link

`docker-compose up` often fails with "operation not permitted" #15660

Open Watercycle opened 1 year ago

Watercycle commented 1 year ago

Bug description

When running docker-compose up (i.e. Docker), about 5-10% of the time random services will fail to start with errors along the lines of the following:

ERROR: for s56 Cannot start service s56: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "proc" to rootfs at "/proc": mount proc:/proc (via /proc/self/fd/6), flags: 0xe: operation not permitted: unknown

ERROR: for s50 Cannot start service s50: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "sysfs" to rootfs at "/sys": mount sysfs:/sys (via /proc/self/fd/6), flags: 0xf: operation not permitted: unknown

Next Actions

  1. [ ] Verify runc-facade's retry mechanism works fine We have already had the retry mechanism to fix the seccomp notify's issue. But probably, it doesn't work fine. https://github.com/gitpod-io/gitpod/blob/6a852abc01b1ff2d45e032ddf6805390219c50b9/components/docker-up/runc-facade/main.go#L93-L101 For example, you might count how many runc or seccomp notify related errors you encountered when you encountered this error on the workspace. If the retry is working properly, there should be more than one of that error.
  2. [ ] Address a problem that we find on the first step
  3. [ ] Try increasing the number of retries if the retry mechanism is found to be working

Steps to reproduce

  1. Create a GitPod workspace running the gitpod/workspace-full image.
  2. Create a docker-compose.yml file with the file contents below.
    docker-compose.yml
version: '3.9'

services:
  s1:
    image: nginx:latest
  s2:
    image: nginx:latest
  s3:
    image: nginx:latest
  s4:
    image: nginx:latest
  s5:
    image: nginx:latest
  s6:
    image: nginx:latest
  s7:
    image: nginx:latest
  s8:
    image: nginx:latest
  s9:
    image: nginx:latest
  s10:
    image: nginx:latest
  s11:
    image: nginx:latest
  s12:
    image: nginx:latest
  s13:
    image: nginx:latest
  s14:
    image: nginx:latest
  s15:
    image: nginx:latest
  s16:
    image: nginx:latest
  s17:
    image: nginx:latest
  s18:
    image: nginx:latest
  s19:
    image: nginx:latest
  s20:
    image: nginx:latest
  s21:
    image: nginx:latest
  s22:
    image: nginx:latest
  s23:
    image: nginx:latest
  s24:
    image: nginx:latest
  s25:
    image: nginx:latest
  s26:
    image: nginx:latest
  s27:
    image: nginx:latest
  s28:
    image: nginx:latest
  s29:
    image: nginx:latest
  s30:
    image: nginx:latest
  s31:
    image: nginx:latest
  s32:
    image: nginx:latest
  s33:
    image: nginx:latest
  s34:
    image: nginx:latest
  s35:
    image: nginx:latest
  s36:
    image: nginx:latest
  s37:
    image: nginx:latest
  s38:
    image: nginx:latest
  s39:
    image: nginx:latest
  s40:
    image: nginx:latest
  s41:
    image: nginx:latest
  s42:
    image: nginx:latest
  s43:
    image: nginx:latest
  s44:
    image: nginx:latest
  s45:
    image: nginx:latest
  s46:
    image: nginx:latest
  s47:
    image: nginx:latest
  s48:
    image: nginx:latest
  s49:
    image: nginx:latest
  s50:
    image: nginx:latest

  1. Run the following command in the same directory.
    docker-compose up -d; docker-compose down -v --timeout=0
  2. Observe the errors that appear in the terminal (screenshot included for reference).
Errors Example Screenshot

While this isn't what a typical compose file looks like, this has helped consistently mimic what members of my team frequently report since we use GitPod to quickly spin up our platform on feature branches. It's quite obnoxious having to restart the platform when core startup services fail.

Workspace affected

No response

Expected behavior

There should be no "operation not permitted" errors. The Docker services should successfully start and enter into a healthy state. Running the "Steps to reproduce" locally on Ubuntu 22, I'm unable to reproduce these startup failures. It only ever happens in GitPod, which is why I'm inclined to file the issue here instead of the docker/compose repo.

Example repository

To be clear, this seems to impact all workspaces. Here's a snapshot with the compose file above using the Haskell sample workspace: https://gitpod.io#snapshot/7b230be0-0aa0-4242-bcff-d9a2229afddd

Anything else?

Worth noting:

  1. When the nginx image in the example compose file is switched with alpine, these errors still happen. Albeit, seemingly less frequently.
  2. I tried installing the latest version of both Docker and Docker Compose, but that didn't seem to make a difference.
  3. I also tried running docker-compose up as the root user in a sudo -i session.
axonasif commented 1 year ago

Hi @Watercycle, thank you for writing down such a detailed bug report 🙏 I'm adding it to the appropriate team's inbox 👋

utam0k commented 1 year ago

FYI https://github.com/gitpod-io/gitpod/issues/12365 We mitigated it, but we cannot make it solve 100% :sob:

utam0k commented 1 year ago

It may be possible to control the number of retries with environment variables.

kylos101 commented 1 year ago

@utam0k @Furisto aside from https://github.com/gitpod-io/gitpod/issues/12365, how else might we be able to help? I'm going to add to breakdown for now, so that it's socialized during refinement next week.

edit: the only thing I can think of, is moving to kata to simplify the runtime, but, know that's far out.

Furisto commented 1 year ago

Likely caused by issues with seccomp notify which are very hard to debug. Apart from @utam0k suggestion to increase the number of retries or switch to kata where we would not need seccomp anymore, I do not see another good solution at the moment.

utam0k commented 1 year ago

@kylos101 @Furisto I wonder if runc-facade's retry mechanism doesn't work. Unfortunately, I didn't find these error messages as I reproduced this error on the preview-env. Realizing that, I created the preview env with this branch. https://github.com/gitpod-io/gitpod/blob/6a852abc01b1ff2d45e032ddf6805390219c50b9/components/docker-up/runc-facade/main.go#L98-L101

So how about making sure whether or not it works fine as a first step?

Furisto commented 1 year ago

So how about making sure whether or not it works fine as a first step?

Sounds reasonable

kylos101 commented 1 year ago

Perfect! @utam0k please update the issue description accordingly? :pray:

utam0k commented 1 year ago

I have added Next Actions section on the description

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.