nayuta-ai / cloud_storage

0 stars 0 forks source link

CrashLoopBackOff #1

Closed nayuta-ai closed 2 years ago

nayuta-ai commented 2 years ago

I have encountered not working the pod error. I think the error is that the process is terminating and not starting up. If you know about that, I would appreciate your sharing your knowledge.

Dockerfile

FROM nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04

RUN apt-get update
RUN apt-get install -y python3 python3-pip
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --upgrade Pillow
RUN pip3 install torch torchvision torchaudio

WORKDIR /work

yaml file

apiVersion: apps/v1
kind: Deployment
metadata:
        labels:
                app: app
        name: app
spec:
        replicas: 1
        selector:
                matchLabels:
                        app: app
        template:
                metadata:
                        labels:
                                app: app
                spec:
                        containers:
                                - image: yuta42173/pytorch-app:latest
                                  name: pytorch
                                  ports:
                                          - containerPort: 8080
                                  resources:
                                          limits:
                                                  cpu: 50m
                                  volumeMounts:
                                          - name: data-volume
                                            mountPath: /work/data
                        volumes:
                                - name: data-volume
                                  hostPath:
                                          path: /home/yuta/cloud_storage/data

cmd

$kubectl get pods
NAME                   READY   STATUS             RESTARTS      AGE
app-7c65b9bf9b-h6wcb   0/1     CrashLoopBackOff   5 (28s ago)   3m37s

$kubectl describe pods app-7c65b9bf9b-h6wcb
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  3m38s                  default-scheduler  Successfully assigned default/app-7c65b9bf9b-h6wcb to minikube
  Normal   Pulled     3m35s                  kubelet            Successfully pulled image "yuta42173/pytorch-app:latest" in 2.451595858s
  Normal   Pulled     3m32s                  kubelet            Successfully pulled image "yuta42173/pytorch-app:latest" in 2.318305813s
  Normal   Pulled     3m14s                  kubelet            Successfully pulled image "yuta42173/pytorch-app:latest" in 2.378239452s
  Normal   Created    2m45s (x4 over 3m35s)  kubelet            Created container pytorch
  Normal   Started    2m45s (x4 over 3m35s)  kubelet            Started container pytorch
  Normal   Pulled     2m45s                  kubelet            Successfully pulled image "yuta42173/pytorch-app:latest" in 2.358181484s
  Warning  BackOff    2m16s (x7 over 3m31s)  kubelet            Back-off restarting failed container
  Normal   Pulling    2m3s (x5 over 3m37s)   kubelet            Pulling image "yuta42173/pytorch-app:latest"
  Normal   Pulled     2m                     kubelet            Successfully pulled image "yuta42173/pytorch-app:latest" in 2.40772084s

Reference

https://www.scsk.jp/sp/sysdig/blog/container_monitoring/crashloopbackoffkubernetes_crashloopbackoff.html

nayuta-ai commented 2 years ago

I encountered an error where the Pod started up but exited because there was no process. I have solved adding the parameter of "command" like the below to impose the operation. "command" parameter imposes the process to pods. The example below makes Pod terminate by setting a task to output the last line of the file (/dev/null).

command: ["tail", "-f", "/dev/null"]

I added the above line in "spec.template.spec.containers.command".

spec:
    template:
        spec:
            containers:
                - command: ["tail", "-f", "/dev/null"]