gitpod-io / gitpod

The developer platform for on-demand cloud development environments to create software faster and more securely.
https://www.gitpod.io
GNU Affero General Public License v3.0
12.97k stars 1.24k forks source link

sometime some docker services crash: OCI runtime failed: container_linux.go #5945

Closed konne closed 2 years ago

konne commented 3 years ago

Bug description

We have sometimes the issue that if I starting multiple containers with docker-compose up that some of them are crashing with the following error. Even in a fresh new started workspace. But in most cases, it works just perfectly without any change.

image

If you have the chance to find more in your logs. It happens 30.Sept 8:17 CET in workspace: salmon-planarian-k3xf4ba4.ws-eu18

Steps to reproduce

unclear

Expected behavior

No response

Example repository

No response

Anything else?

No response

iQQBot commented 3 years ago

It does happen occasionally, only need to re-execute the command to resolve it.

csweichel commented 3 years ago

/schedule

roboquat commented 3 years ago

@csweichel: Issue scheduled in the workspace team (WIP: 0)

In response to [this](https://github.com/gitpod-io/gitpod/issues/5945#issuecomment-931147290): >/schedule Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
princerachit commented 3 years ago

I looked around if I can find something relevant and stumbled upon this issue. I found this comment relatable. The issue is closed and an issue exists here.

Can you share the docker compose file and the docker files used by you in this project? If you find your issue similar to mentioned in previous paragraph, let us know.

konne commented 3 years ago

@princerachit Thanks for linking the issue. I already read them before I was filling out my issue:

  1. we have a closed project and I will share the docker-compose a little bit anonymized
  2. for me this is a different issue because we have never seen that before on any different docker cluster and it just sometimes happens, but in most cases, it works just perfect.

We just run it with the following command and the two added files. Important they are anonymized. So can not be 1:1 tested from your side. You can just look through.

DOCKER_BUILDKIT=1 COMPOSE_DOCKER_CLI_BUILD=1 docker-compose -f docker-compose.yml -f docker-compose.build.yml up

compose.yml


version: '3.7'

x-core-external-services: &core-external-services
  postgres:
    container_name: postgres
    image: postgres:13.4-alpine
    ports:
      - 5432:5432
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -U postgres']
      interval: 2s
      timeout: 2s
      retries: 5
    volumes:
      - product-postgres-data:/var/lib/postgresql/data

  redis:
    container_name: redis
    image: redis
    command: redis-server --save ''
    ports:
      - 6379:6379
    tmpfs:
      - /data

  nats:
    container_name: nats
    image: nats:2.4.0
    ports:
      - 4222:4222
      - 8222:8222

  kes:
    container_name: kes
    image: minio/kes
    healthcheck:
      interval: 2s
      timeout: 2s
      retries: 5
    volumes:
      - ... KES MAPPING
    ports:
      - 7373:7373

x-service-depends-on-external-names: &default-service-depends-on-external-names
  postgres:
    condition: service_healthy
  nats:
    condition: service_started
  redis:
    condition: service_started

x-service-depends-on-external: &default-service-depends-on-external
  depends_on:
    <<: *default-service-depends-on-external-names

x-service: &default-service
  <<: *default-service-depends-on-external
  mem_limit: 1024m
  mem_reservation: 128M
  pull_policy: always
  env_file:
    - ./.product.local.env
    - ./.env

x-service-api-name: &default-service-api-name 'api'
x-service-api: &default-service-api
  <<: *default-service
  container_name: *default-service-api-name
  image: company/product-api
  expose:
    - 80
  ports:
    - 3333:80
    - 30227:30227

x-service-auth-name: &default-service-auth-name 'auth'
x-service-auth: &default-service-auth
  <<: *default-service
  container_name: *default-service-auth-name
  image: company/product-auth
  ports:
    - 30233:30233

x-service-object-name: &default-service-object-name 'object'
x-service-object: &default-service-object
  <<: *default-service
  container_name: *default-service-object-name
  image: company/product-object
  ports:
    - 30234:30234

x-service-web-name: &default-service-web-name 'web'
x-service-web: &default-service-web
  <<: *default-service
  container_name: *default-service-web-name
  image: company/product-web

x-service-auth-login-name: &default-service-auth-login-name 'auth-login'
x-service-auth-login: &default-service-auth-login
  <<: *default-service
  container_name: *default-service-auth-login-name
  image: company/product-auth-login

x-service-da-engine-name: &default-service-engine-name 'engine'
x-service-da-engine: &default-service-engine
  <<: *default-service
  container_name: *default-service-da-engine-name
  image: company/product-engine
  volumes:
    - ./tools/kes/certs/client.cert:/certs/client.cert
    - ./tools/kes/certs/client.key:/certs/client.key
  ports:
    - 8333:8333
  depends_on:
    <<: *default-service-depends-on-external-names
    kes:
      condition: service_healthy
  environment:
    - ... ENV SETTINGS

services:
  <<: *core-external-services

  nginx:
    container_name: nginx
    image: nginx:latest
    volumes:
      - ./tools/nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      web:
        condition: service_started
      api:
        condition: service_started
    ports:
      - ${product_NGINX_PORT:-80}:80

  web:
    <<: *default-service-web

  api:
    <<: *default-service-api

  auth:
    <<: *default-service-auth

  object:
    <<: *default-service-object

  engine:
    <<: *default-service-engine

networks:
  default:
    name: product-network

volumes:
  product-volume:
    name: product-volume
    driver: local
    driver_opts:
      type: none
      o: bind
      device: '${PWD}'
  product-postgres-data:
    name: product-postgres-data

compose.build.yml

version: '3.7'

x-service-node-build: &default-service-node-build
  image: 'node:14-alpine'
  working_dir: /usr/src/app
  volumes:
    - product-volume:/usr/src/app
  env_file:
    - ./.product.local.env
    - ./.env

services:
  api:
    <<: *default-service-node-build
    command: npm run start api

  auth:
    <<: *default-service-node-build
    command: npm run start auth

  object:
    <<: *default-service-node-build
    command: npm run start object

.... just more in the same way

  engine:
    build:
      context: .
      dockerfile: ./apps/engine/Dockerfile

  web:
    <<: *default-service-node-build
    command: npm start -- web --port 80 --host 0.0.0.0 --disableHostCheck
    mem_limit: 4096m
    environment:
      - product_API_HOST=api
    ports:
      - 3380:80
    expose:
      - 80
princerachit commented 3 years ago

@konne Thanks for sharing the files. Do you have the logs of the failure stored somewhere, Can you redact sensitive info and share rest of the log with us. If you don't have the logs and you see this issue again please take a dump of the log and share it with us.

konne commented 3 years ago

@princerachit I don't have the log and I have unfortunately the workspace already deleted. What logs do you need? Where can I find these logs?

It happens at least once a week, so I only need this info and then I will share the logs.

princerachit commented 3 years ago

@konne I have added more logs to debug this issue further from our side. Once this PR is merged and deployed we would have more visibility on what is happening.

princerachit commented 3 years ago

I am closing this issue. Let me know if you see this again. We have appropriate logs to investigate.

konne commented 3 years ago

@princerachit we have this issue nearly every day since we now expanded the userbase step by step. Please let it open and work on this topic. If you need I can also ask the team to add here every time this occurs.

princerachit commented 3 years ago

Thanks @konne . We need the following information whenever you see the issue:

konne commented 3 years ago

26.10.2021 9:30 (CET)

azure-alligator-ftal0uzb.ws-eu17

docker-compose up

image

konne commented 3 years ago

happen again:

crimson-silverfish-87dr8961

image

konne commented 3 years ago

happen again:

blush-vole-efcfmm4f.ws-eu17

image

jmls commented 2 years ago

we're having this happen on an increasing basis. If we do docker-compose down followed by docker-compose up it often solves the issue, so it can't be related to the containers or configuration

kylos101 commented 2 years ago

Hi @konne, I hope you're well! May I ask, if you're able to share the name of a workspace and time where you experienced this issue, that would be great. Also, if you're able to share a public repo where I can reliably recreate the problem that would be super. I did a brief attempts here to recreate the problem, but was not successful.

konne commented 2 years ago

@kylos101 all the workspace ideas from my side I always added and the time was always like 2mins after it happen if you look to the comment. No, we have no public repo, but I can make with you a teamviewer, zoom, whatever meeting, and show it to you. We can not reliably reproduce it. During normal usage, it happens around 2-3 times a week per developer.

kylos101 commented 2 years ago

Hey @konne :wave: , I'm sorry, I should have explained why I asked for data again.

We setup a new tracing system in mid-December 2021 :bulb: . Old data was not migrated to the new tracing system, therefore I cannot search for older workspaces. :disappointed:

If the issue happens again, and you get a moment, please share the related workspace? :pray: I apologize, it is frustrating to share the same thing repetitively and not get a desirable outcome. However, I am certain the new data will be helpful. Let us know?

kylos101 commented 2 years ago

Hi @konne , I just sent you an email asking for more information. This is intentional, I do not wish for you to share a workspace snapshot URL in this issue. Let us know if that's possible?

konne commented 2 years ago

@kylos101 sorry, I don't work day by day with gitpod so I rely a little bit on developer feedback. Here is the first entry:

purple-dodo-gqnfpk03 ; time: 11:15 CET 17.01.2022

image

csweichel commented 2 years ago

We just rolled out what we hope is a fix for this issue (#7657). I'll close this issue for now. Please re-open/file a new one if this problem persists.

The major contributing factor was a timing/race issue in the libseccomp golang bindings we're using. This caused the mount syscall interception via seccomp-notify to fail in highly concurrent/heavy load scenarios.