k3d-io / k3d

Little helper to run CNCF's k3s in Docker
https://k3d.io/
MIT License
5.47k stars 462 forks source link

[BUG] k3d fails to cluster create using gitlab ci docker executor when memory arguments is set #1087

Open jonathanyuechun opened 2 years ago

jonathanyuechun commented 2 years ago

What did you do

CI image: docker:20.10.2 and docker:20.10.2-dind as service

error with k3d v5.4.3

# gitlab command
gitlab-runner exec docker\
   --docker-privileged\
  --docker-network-mode host\
  --docker-volumes /var/run/docker.sock:/var/run/docker.sock\
  <mystage>

# stage command
# k3d version: v5.4.3
k3d cluster create mycluster\
--image rancher/k3s:latest \
--agents 1 \
--servers 1 \
--servers-memory 1g \ # this cause error
--agents-memory 1g \ # this cause error
--wait

error with k3d v4.4.8

The same error will happen with v4.4.8 unless memory arguments are not set

## OK
--servers-memory 1g # remove me
--agents-memory 1g # remove me

What did you expect to happen

Concise description of what you expected to happen after doing what you described above.

Screenshots or terminal output

If applicable, add screenshots or terminal output (code block) to help explain your problem.

Which OS & Architecture

Ubuntu 18.04 LTS

Which version of k3d

v5.4.3

v4.4.8

Which version of docker

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.7.1-docker)
  scan: Docker Scan (Docker Inc., v0.12.0)

Server:
 Containers: 32
  Running: 4
  Paused: 0
  Stopped: 28
 Images: 854
 Server Version: 20.10.12
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc version: v1.0.2-0-g52b36a2
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-117-generic
 Operating System: Ubuntu 18.04.6 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 16
 Total Memory: 31GiB
 Name: dev.server.io
 ID: LDEW:LQJR:BASJ:REX6:5RZA:JLLH:RELZ:KGU6:HRXE:JDIN:UDHW:EBFP
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support
  1. Does k3d v5+ is supported running in a gitlab-ci dind ?
  2. Does the cluster create memory limits parameters setters for the cluster node supported in a gitlab-ci dind ?

Note: Running k3d on a native linux platform works as expected...

jonathanyuechun commented 2 years ago

this issue can easily be replicated using the gitlab-ci.yml below

stages:
- test

variables:
  DOCKER_TLS_CERTDIR: ""
  DOCKER_DRIVER: overlay2

stages:
  - test-cluster

test-cluster:
  image: docker:stable
  variables:
    KUBECTL: v1.21.3
  stage: test
  services:
    - docker:stable-dind
  before_script:
    - apk add -U wget bash curl
    # install kubectl
    - wget -O /usr/local/bin/kubectl https://storage.googleapis.com/kubernetes-release/release/${KUBECTL}/bin/linux/amd64/kubectl
    - chmod +x /usr/local/bin/kubectl
    # install k3d
    - wget -q -O - https://raw.githubusercontent.com/rancher/k3d/main/install.sh | bash
    - k3d help
    - k3d version
    - k3d cluster create testgitlabci --agents 1 --servers 1 --servers-memory 1g --agents-memory 1g --wait

    - kubectl cluster-info
  script:

    # Display initial pods, etc.
    - kubectl get nodes -o wide
    - kubectl get pods --all-namespaces -o wide
    - kubectl get services --all-namespaces -o wide
  after_script: ['k3d cluster delete testgitlabci']
iwilltry42 commented 2 years ago

Hi @jonathanyuechun thanks for creating this issue!

I may have missed it, but can you please share the actual error you're seeing? Any logs would be helpful 👍

jonathanyuechun commented 2 years ago

@iwilltry42 no problem. Below are the logs

INFO[0000] Prep: Network                                
INFO[0000] Created network 'k3d-metascheduler-1585478' (18a6fca94d647768ec6e24cf2aa72b99e3fce173d06960a6b3876411aaf1fc71) 
INFO[0000] Created volume 'k3d-metascheduler-1585478-images' 
INFO[0001] Creating node 'k3d-metascheduler-1585478-server-0' 
INFO[0010] Deleted 57fd0275f6374f899d2b61473b364cd792f80349d8abef83e0862851e4e10585 
INFO[0010] Creating node 'k3d-metascheduler-1585478-agent-0' 
INFO[0011] Deleted 1b487364c7e3ce551755bbfa9c99db67118326f93824ef22b32befc472bc3607 
INFO[0011] Creating LoadBalancer 'k3d-metascheduler-1585478-serverlb' 
INFO[0011] Starting cluster 'metascheduler-1585478'     
INFO[0011] Starting servers...                          
INFO[0011] Starting Node 'k3d-metascheduler-1585478-server-0' 
ERRO[0012] Failed to start node 'k3d-metascheduler-1585478-server-0' 
ERRO[0012] Failed Cluster Start: Failed to start server k3d-metascheduler-1585478-server-0: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:59: mounting "/root/.k3d/.k3d-metascheduler-1585478-server-0/meminfo" to rootfs at "/var/lib/docker/overlay2/536b6a4245ca21f398d17f9778a2f348f7d129d3e6e7fb89ab1de8aa9ee740ba/merged/proc/meminfo" caused: not a directory: unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type 
ERRO[0012] Failed to create cluster >>> Rolling Back    
INFO[0012] Deleting cluster 'metascheduler-1585478'     
INFO[0012] Deleted k3d-metascheduler-1585478-server-0   
INFO[0012] Deleted k3d-metascheduler-1585478-agent-0    
INFO[0012] Deleted k3d-metascheduler-1585478-serverlb   
INFO[0012] Deleting cluster network 'k3d-metascheduler-1585478' 
INFO[0012] Deleting image volume 'k3d-metascheduler-1585478-images' 
FATA[0012] Cluster creation FAILED, all changes have been rolled back!
jonathanyuechun commented 2 years ago

@iwilltry42 any news about it or workaround ?

iwilltry42 commented 2 years ago

Hey @jonathanyuechun I just got back to this now and did a quick Google search, which turned up this issue: https://stackoverflow.com/a/66470542/6450189 Can you please give the suggested change (related to DIND in GitlabCI) a try?

jonathanyuechun commented 2 years ago

@iwilltry42 the proposed solution is not to use DinD, which is not possible on my ends...

iwilltry42 commented 2 years ago

Googling for it, this issue is all around for Gitlab CI DIND runners :thinking: I guess it is because k3d is running in one container (c1) but talking to the Docker Daemon in another (dind) container (c2). In that case, k3d would create the faked meminfo file in c1, which is then not available in c2. I don't see many ways around that... except for some that do not rely on mounting a file (i.e. changing k3d code). You could however do the following ugly trick:

jonathanyuechun commented 2 years ago

Hi thanks for the hack, this was what i thought at the beginning, sadly i don't have direct access to the runner itself. How about k3d itself integrate a new cmd:

k3d fake --agents 1 --servers 1 --servers-memory 1g --agents-memory 1g > myFakeMemInfo
k3d create cluster --agents 1 --servers 1 --servers-memory 1g --agents-memory 1g --meminfo-path myFakeMemInfo 

Would this make sense ?

iwilltry42 commented 2 years ago

I guess as a feature request, we could change the code to not use a file mount but rather just write the file directly into the container filesystem using a pre-start hook. I'll add it to a future milestone.

How would your proposals help you though? You'd still have to get the fake meminfo file into the docker host (the GItLab service container), right? If you can do that, then you can as well do cat << EOF > /file/on/servicecontainer ... EOF in your GitLab CI pipeline and use the --volume flag as proposed above :thinking:

jonathanyuechun commented 2 years ago

hi sorry for late reply !

Hmmn, to answer your question. I probably messed up my thinking :(

jonathanyuechun commented 2 years ago

Hi again,

For information, the same bug can easily be reproduced using devcontainers tool from vscode.

https://code.visualstudio.com/docs/remote/containers