Open malixian opened 5 years ago
Hi @malixian,
Please provide the following information:
go version
sycri version
)singularity version
kubectl version
BTW do you have docker running? I see you require docker.service
in kubelet service, which is not needed if you use other runtime.
thanks @sashayakovtseva , Here is my environment:
> apiVersion: v1
> kind: Pod
> metadata:
> name: test-arm64
> namespace: default
> spec:
> containers:
> name: test-arm64
> image: cloud.sylabs.io/malixian/default/test-arm64:latest
> nodeSelector:
> beta.kubernetes.io/arch: arm64
journalctl -u kubelet -e, find error:
1.provider.go:116] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
2.SyncLoop (PLEG): "test-arm64_default(d938b836-e003-11e9-9f5c-ac1f6bac1d10)", event: &pleg.PodLifecycleEvent{ID:"d938b836-e003-11e9-9f5c-ac1f6bac1d10", Type:"ContainerDied,Data:"29950f6fcfc74d1e6268caba276c898280949621a5e31fc9e9ff91e50b4a360e"}
journalctl -u sycri -e, find error:
Sep 26 14:20:40 localhost sycri[28374]:E0926 14:20:40.218152 28374 main.go:276] /runtime.v1alpha2.ImageService/PullImage
Sep 26 14:20:40 localhost sycri[28374]: Request: {"image":{"image":"cloud.sylabs.io/malixian/default/test-arm64:latest"}}
Sep 26 14:20:40 localhost sycri[28374]: Response: null
Sep 26 14:20:40 localhost sycri[28374]: Error: rpc error: code = Internal desc = could not get cloud.sylabs.io/malixian/default/test-arm64:latest image metadata: could not get library image info: error making request to server:
Sep 26 14:20:40 localhost sycri[28374]: Get https://library.sylabs.io/v1/images/malixian/default/test-arm64:latest: net/http: TLS handshake timeout
Sep 26 14:34:31 localhost sycri[28374]: Calico CNI releasing IP address
Sep 26 14:34:31 localhost sycri[28374]: Calico CNI deleting device in netns /var/run/singularity/pods/19d9118ee6ac1e19ce475a4650ae0021f752c635009b0888b3f5d03b0a803f2a/namesp
Sep 26 14:34:31 localhost sycri[28374]: Calico CNI deleted device in netns /var/run/singularity/pods/19d9118ee6ac1e19ce475a4650ae0021f752c635009b0888b3f5d03b0a803f2a/namespa
Sep 26 14:34:36 localhost sycri[28374]: Calico CNI IPAM request count IPv4=1 IPv6=0
Sep 26 14:34:36 localhost sycri[28374]: Calico CNI IPAM handle=k8s-pod-network.1a08e33abd1973f28f4b6d0ec580ca335376eaba7e052f29dc94c93a10bb6696
Sep 26 14:34:36 localhost sycri[28374]: Calico CNI IPAM assigned addresses IPv4=[172.22.20.159] IPv6=[]
Sep 26 14:34:36 localhost sycri[28374]: Calico CNI using IPs: [172.22.20.159/32]
Sep 26 14:35:00 localhost sycri[28374]: E0926 14:35:00.032013 28374 container.go:177] Could not fetch container 393b15294a318ec6dd6fcb96a02a663345ec346f937e904982fa6acfd02
Sep 26 14:35:30 localhost sycri[28374]: E0926 14:35:30.725947 28374 pod.go:164] Could not update pod state: could not get pod state: could not query state: FATAL: no con
Sep 26 14:45:46 localhost sycri[28374]: E0926 14:45:46.791990 28374 container.go:177] Could not fetch container 1e60e5d99769517eaa4add0bc7c70daecd053171434ff4ecaba5dd517f9
Sep 26 14:56:14 localhost sycri[28374]: E0926 14:56:14.904313 28374 container.go:177] Could not fetch container 498ddec28abee523f56c20ab1f173e690e382b4c5d3ec10c1f89993b701
it looks like image not't exist,but i have pushed image with
> singualrity push image.sif library://malixian/default/test-arm64
i try to deploy sycri in x86,but also appear some error: journalctl -u kubelet -e
pod_workers.go:190] Error syncing pod e0f2d839-e038-11e9-9f5c-ac1f6bac1d10 ("image-service-deployment-655d89d94d-rfl5f_default(e0f2d839-e038-11e9-9f5c-ac1f6bac1d10)"), skipping: failed to "StartContainer" for "image-server" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=image-server pod=image-service-deployment-655d89d94d-rfl5f_default(e0f2d11e9-9f5c-ac1f6bac1d10)"
cri_stats_provider.go:320] Failed to get the info of the filesystem with mountpoint "/run": failed to ge
9月 26 17:06:18 comput1 kubelet[2792]: E0926 17:06:18.623068 2792 cri_stats_provider.go:320] Failed to get the info of the filesystem with mountpoint "/run": failed to get device for dir "/run": could not find device with major: 0, minor: 20 in cached partitions map.
journalctl -u sycri -e
9月 26 16:43:55 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:44:41 comput1 sycri[1787]: E0926 16:44:41.549846 1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:44:41 comput1 sycri[1787]: Request: {"container_id":"adcec04f0d24bf166b94b202dc7b1ebdd2dae6a0437c6982cc5f8111f8c6ead5"}
9月 26 16:44:41 comput1 sycri[1787]: Response: null
9月 26 16:44:41 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:46:05 comput1 sycri[1787]: E0926 16:46:05.197723 1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:46:05 comput1 sycri[1787]: Request: {"container_id":"d33181a4531a36e3f365c1d9b3b6107137a28c64ad752d762ce603fe1b95a7cd"}
9月 26 16:46:05 comput1 sycri[1787]: Response: null
9月 26 16:46:05 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:46:06 comput1 sycri[1787]: E0926 16:46:06.262393 1787 container.go:177] Could not fetch container adcec04f0d24bf166b94b202dc7b1ebdd2dae6a0437c6982cc5f8111f8c6ea
9月 26 16:48:50 comput1 sycri[1787]: E0926 16:48:50.075495 1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:48:50 comput1 sycri[1787]: Request: {"container_id":"82800d31b039ae364530530a4138e1ac245e222c6f80b83c74c491ea0f84a94e"}
9月 26 16:48:50 comput1 sycri[1787]: Response: null
@malixian What version of singularity-cri you are using?
Can you confirm the same network error appears if you do singularity pull library://malixian/default/test-arm64:latest
on that host?
@malixian What version of singularity-cri you are using?
Can you confirm the same network error appears if you do
singularity pull library://malixian/default/test-arm64:latest
on that host?
it's ok in that host.
And i try to deploy sycri in x86,but also appear some error:
journalctl -u kubelet -e
pod_workers.go:190] Error syncing pod e0f2d839-e038-11e9-9f5c-ac1f6bac1d10 ("image-service-deployment-655d89d94d-rfl5f_default(e0f2d839-e038-11e9-9f5c-ac1f6bac1d10)"), skipping: failed to "StartContainer" for "image-server" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=image-server pod=image-service-deployment-655d89d94d-rfl5f_default(e0f2d11e9-9f5c-ac1f6bac1d10)"
cri_stats_provider.go:320] Failed to get the info of the filesystem with mountpoint "/run": failed to ge
9月 26 17:06:18 comput1 kubelet[2792]: E0926 17:06:18.623068 2792 cri_stats_provider.go:320] Failed to get the info of the filesystem with mountpoint "/run": failed to get device for dir "/run": could not find device with major: 0, minor: 20 in cached partitions map.
journalctl -u sycri -e
9月 26 16:43:55 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:44:41 comput1 sycri[1787]: E0926 16:44:41.549846 1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:44:41 comput1 sycri[1787]: Request: {"container_id":"adcec04f0d24bf166b94b202dc7b1ebdd2dae6a0437c6982cc5f8111f8c6ead5"}
9月 26 16:44:41 comput1 sycri[1787]: Response: null
9月 26 16:44:41 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:46:05 comput1 sycri[1787]: E0926 16:46:05.197723 1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:46:05 comput1 sycri[1787]: Request: {"container_id":"d33181a4531a36e3f365c1d9b3b6107137a28c64ad752d762ce603fe1b95a7cd"}
9月 26 16:46:05 comput1 sycri[1787]: Response: null
9月 26 16:46:05 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 26 16:46:06 comput1 sycri[1787]: E0926 16:46:06.262393 1787 container.go:177] Could not fetch container adcec04f0d24bf166b94b202dc7b1ebdd2dae6a0437c6982cc5f8111f8c6ea
9月 26 16:48:50 comput1 sycri[1787]: E0926 16:48:50.075495 1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 26 16:48:50 comput1 sycri[1787]: Request: {"container_id":"82800d31b039ae364530530a4138e1ac245e222c6f80b83c74c491ea0f84a94e"}
9月 26 16:48:50 comput1 sycri[1787]: Response: null
@malixian What version of singularity-cri you are using? Can you confirm the same network error appears if you do
singularity pull library://malixian/default/test-arm64:latest
on that host?it's ok in that host. And i try to deploy sycri in x86,but also appear some error: yaml file is official example:
apiVersion: apps/v1 kind: Deployment metadata: name: image-service-deployment namespace: default spec: replicas: 1 selector: matchLabels: app: image-service template: metadata: labels: app: image-service name: image-service namespace: default spec: containers: - name: image-server image: cloud.sylabs.io/sashayakovtseva/test/image-server ports: - containerPort: 8080 securityContext: runAsUser: 1000 nodeSelector: kubernetes.io/hostname: 10.18.127.1 --- apiVersion: v1 kind: Service metadata: name: image-service spec: type: NodePort ports: - port: 80 targetPort: 8080 selector: app: image-service
journalctl -u kubelet -e
pod_workers.go:190] Error syncing pod e0f2d839-e038-11e9-9f5c-ac1f6bac1d10 ("image-service-deployment-655d89d94d-rfl5f_default(e0f2d839-e038-11e9-9f5c-ac1f6bac1d10)"), skipping: failed to "StartContainer" for "image-server" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=image-server pod=image-service-deployment-655d89d94d-rfl5f_default(e0f2d11e9-9f5c-ac1f6bac1d10)" cri_stats_provider.go:320] Failed to get the info of the filesystem with mountpoint "/run": failed to ge 9月 26 17:06:18 comput1 kubelet[2792]: E0926 17:06:18.623068 2792 cri_stats_provider.go:320] Failed to get the info of the filesystem with mountpoint "/run": failed to get device for dir "/run": could not find device with major: 0, minor: 20 in cached partitions map.
journalctl -u sycri -e
9月 26 16:43:55 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4 9月 26 16:44:41 comput1 sycri[1787]: E0926 16:44:41.549846 1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer 9月 26 16:44:41 comput1 sycri[1787]: Request: {"container_id":"adcec04f0d24bf166b94b202dc7b1ebdd2dae6a0437c6982cc5f8111f8c6ead5"} 9月 26 16:44:41 comput1 sycri[1787]: Response: null 9月 26 16:44:41 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4 9月 26 16:46:05 comput1 sycri[1787]: E0926 16:46:05.197723 1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer 9月 26 16:46:05 comput1 sycri[1787]: Request: {"container_id":"d33181a4531a36e3f365c1d9b3b6107137a28c64ad752d762ce603fe1b95a7cd"} 9月 26 16:46:05 comput1 sycri[1787]: Response: null 9月 26 16:46:05 comput1 sycri[1787]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4 9月 26 16:46:06 comput1 sycri[1787]: E0926 16:46:06.262393 1787 container.go:177] Could not fetch container adcec04f0d24bf166b94b202dc7b1ebdd2dae6a0437c6982cc5f8111f8c6ea 9月 26 16:48:50 comput1 sycri[1787]: E0926 16:48:50.075495 1787 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer 9月 26 16:48:50 comput1 sycri[1787]: Request: {"container_id":"82800d31b039ae364530530a4138e1ac245e222c6f80b83c74c491ea0f84a94e"} 9月 26 16:48:50 comput1 sycri[1787]: Response: null
@malixian Is this is not an issue anymore? Looks like you have accidentally closed it.
And I need a full output of kubectl describe no <your node>
and sycri version
.
@sashayakovtseva yes,you're right, sycri version is 1.0.0-beta.5. node information is
Name: 10.18.127.3
Roles: node
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=10.18.127.3
kubernetes.io/role=node
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 12 Jun 2019 16:20:55 +0800
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Thu, 26 Sep 2019 17:56:07 +0800 Wed, 12 Jun 2019 16:20:55 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 26 Sep 2019 17:56:07 +0800 Wed, 12 Jun 2019 16:20:55 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 26 Sep 2019 17:56:07 +0800 Wed, 12 Jun 2019 16:20:55 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 26 Sep 2019 17:56:07 +0800 Tue, 24 Sep 2019 17:55:59 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.18.127.3
Hostname: 10.18.127.3
Capacity:
cpu: 32
ephemeral-storage: 511750Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65393136Ki
pods: 110
Allocatable:
cpu: 32
ephemeral-storage: 482947890401
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65290736Ki
pods: 110
System Info:
Machine ID: 2e4e94b510a14392ab58491d3e377c96
System UUID: 00000000-0000-0000-0000-AC1F6BAC404E
Boot ID: d5e7100c-4a2f-4a96-a776-179cf47676ef
Kernel Version: 3.10.0-957.21.2.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://18.9.2
Kubelet Version: v1.13.5
Kube-Proxy Version: v1.13.5
PodCIDR: 172.22.1.0/24
Non-terminated Pods: (40 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system calico-kube-controllers-84db645bdf-ggptb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kube-system calico-node-r85wt 250m (0%) 0 (0%) 0 (0%) 0 (0%) 106d
kube-system coredns-7c5785cbcc-2f4r6 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 2d
kube-system coredns-7c5785cbcc-pqvfs 100m (0%) 0 (0%) 70Mi (0%) 170Mi (0%) 2d
kube-system heapster-5b9b6b6597-d5n67 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kube-system kubernetes-dashboard-76479d66bb-j9bpg 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kube-system metrics-server-79558444c6-9gx75 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow ambassador-6f7cc986df-5qg87 200m (0%) 1 (3%) 100Mi (0%) 400Mi (0%) 2d
kubeflow ambassador-6f7cc986df-lp6bq 200m (0%) 1 (3%) 100Mi (0%) 400Mi (0%) 2d
kubeflow ambassador-6f7cc986df-m66w7 200m (0%) 1 (3%) 100Mi (0%) 400Mi (0%) 2d
kubeflow argo-ui-db7cf456c-dlhlf 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow centraldashboard-79f6448bb7-x6q55 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow config-controller-6d84df4f66-6g85b 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow jupyter-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow jupyter-web-app-78844bd57-lhsc2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow katib-ui-bf44885cd-6bk4l 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow metacontroller-0 500m (1%) 4 (12%) 1Gi (1%) 4Gi (6%) 2d
kubeflow minio-6d879f8d6c-k9sg9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow ml-pipeline-5dfc9cc665-4cvrz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow ml-pipeline-persistenceagent-5c5d669f5d-xg7lt 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow ml-pipeline-scheduledworkflow-84ddd9886d-hg7kr 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow ml-pipeline-ui-58c78c9ffb-dqlhd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow ml-pipeline-viewer-controller-deployment-547bb45844-96xp9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow mysql-58cfd7c97b-9vcjs 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow notebooks-controller-86c8944799-5x952 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow profiles-7896d9bd97-phd44 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow pytorch-operator-54484d9b6c-jpgp5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow spartakus-volunteer-6798cc9878-kjpd8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow studyjob-controller-58bccc4747-n4vcn 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow tf-job-dashboard-56564f6f99-9c6gb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow tf-job-operator-6bfd5c7db8-qlctd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow vizier-core-cfd9566b-mzsv8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow vizier-core-rest-6c69cd9656-nkdwl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow vizier-db-6885dbd6cb-twm5p 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow vizier-suggestion-bayesianoptimization-7ddbbd49b6-4tspd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow vizier-suggestion-grid-ccc744bfb-bwvhp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow vizier-suggestion-hyperband-5bfbd98c78-snc7m 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow vizier-suggestion-random-f69bf84f4-4dx7p 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow workflow-controller-6866879d86-jhlsl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 2d
kubeflow wzy-0 4 (12%) 0 (0%) 8Gi (12%) 0 (0%) 2d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 5550m (17%) 7 (21%)
memory 9656Mi (15%) 5636Mi (8%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
Can you please update sycri to the latest version?
Also node information says you are running docker. I think you pasted the wrong node info here. According to your pod yaml you schedule it to 10.18.127.1
, but info is for 10.18.127.3
.
sorry, it's my fault。and your suggestion is update sycri to 1.0.0-beta.6?current is 1.0.0-beta.5
Name: 10.18.127.1
Roles: node
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=10.18.127.1
kubernetes.io/role=node
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 12 Jun 2019 16:20:03 +0800
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Thu, 26 Sep 2019 18:30:27 +0800 Fri, 21 Jun 2019 14:36:04 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 26 Sep 2019 18:30:27 +0800 Fri, 21 Jun 2019 14:36:04 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 26 Sep 2019 18:30:27 +0800 Fri, 21 Jun 2019 14:36:04 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 26 Sep 2019 18:30:27 +0800 Thu, 26 Sep 2019 16:29:11 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.18.127.1
Hostname: 10.18.127.1
Capacity:
cpu: 32
ephemeral-storage: 511750Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65393136Ki
pods: 110
Allocatable:
cpu: 32
ephemeral-storage: 482947890401
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65290736Ki
pods: 110
System Info:
Machine ID: 8b1ca9216f384e5c90f309b4af7066b1
System UUID: 00000000-0000-0000-0000-AC1F6BAC1D10
Boot ID: b3e7c019-5477-4a70-8e16-0b0ceac4da50
Kernel Version: 3.10.0-957.21.2.el7.x86_64
OS Image: CentOS Linux 7 (Core)
Operating System: linux
Architecture: amd64
Container Runtime Version: singularity://3.4.0-1
Kubelet Version: v1.13.5
Kube-Proxy Version: v1.13.5
PodCIDR: 172.22.0.0/24
Non-terminated Pods: (2 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default image-service-deployment-655d89d94d-rfl5f 0 (0%) 0 (0%) 0 (0%) 0 (0%) 113m
kube-system calico-node-ps7bb 250m (0%) 0 (0%) 0 (0%) 0 (0%) 106d
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 250m (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
your suggestion is update sycri to 1.0.0-beta.6
Yes, there've been some fixes
Also enabling debug logs will help us to understand what is wrong.
In sycri service add -v 6
option to sycri command and then restart services (assuming systemd is used):
$ sudo systemctl daemon-realod
$ sudo systemctl stop kubelet sycri
$ sudo systemctl restart sycri kubelet
@sashayakovtseva unfortunately, it doesn't work.but i can see sycri detail execute information:
9月 27 11:16:46 comput1 sycri[189766]: DEBUG [U=1000,P=1] startup() oci runtime engine selected
9月 27 11:16:46 comput1 sycri[189766]: VERBOSE [U=1000,P=1] startup() Execute stage 2
9月 27 11:16:46 comput1 sycri[189766]: DEBUG [U=1000,P=1] StageTwo() Entering stage 2
9月 27 11:16:46 comput1 sycri[189766]: I0927 11:16:46.874076 189766 sync.go:76] Received state 2 at /var/run/singularity/containers/2d3675bcb18c6c9aa0d34fe2018da97449838a8a7720a318f23f8da7de721b1a/sync.sock
9月 27 11:16:46 comput1 sycri[189766]: I0927 11:16:46.877823 189766 client_oci.go:125] Stream copying returned: context canceled
9月 27 11:16:46 comput1 sycri[189766]: I0927 11:16:46.989017 189766 container.go:288] Starting container 2d3675bcb18c6c9aa0d34fe2018da97449838a8a7720a318f23f8da7de721b1a
9月 27 11:16:46 comput1 sycri[189766]: I0927 11:16:46.989140 189766 client.go:87] Executing [singularity -d oci start 2d3675bcb18c6c9aa0d34fe2018da97449838a8a7720a318f23f8da7de721b1a]
9月 27 11:16:47 comput1 sycri[189766]: DEBUG [U=0,P=175124] createConfDir() /root/.singularity already exists. Not creating.
9月 27 11:16:47 comput1 sycri[189766]: I0927 11:16:47.071929 189766 sync.go:76] Received state 4 at /var/run/singularity/containers/2d3675bcb18c6c9aa0d34fe2018da97449838a8a7720a318f23f8da7de721b1a/sync.sock
9月 27 11:16:47 comput1 sycri[189766]: E0927 11:16:47.079662 189766 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 27 11:16:47 comput1 sycri[189766]: Request: {"container_id":"2d3675bcb18c6c9aa0d34fe2018da97449838a8a7720a318f23f8da7de721b1a"}
9月 27 11:16:47 comput1 sycri[189766]: Response: null
9月 27 11:16:47 comput1 sycri[189766]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 27 11:16:48 comput1 sycri[189766]: I0927 11:16:48.612509 189766 container_files.go:104] Removing bundle at /var/run/singularity/containers/1d632b4b2ce0d403c730b400469f5d663e8cc900992db8fd0ce0a1bf499e68a8/bundle
9月 27 11:16:48 comput1 sycri[189766]: I0927 11:16:48.657972 189766 container_files.go:118] Removing container base directory /var/run/singularity/containers/1d632b4b2ce0d403c730b400469f5d663e8cc900992db8fd0ce0a1bf499e68a8
9月 27 11:16:53 comput1 sycri[189766]: I0927 11:16:53.742925 189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]
9月 27 11:17:03 comput1 sycri[189766]: I0927 11:17:03.742912 189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]
9月 27 11:17:13 comput1 sycri[189766]: I0927 11:17:13.742929 189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]
9月 27 11:17:23 comput1 sycri[189766]: I0927 11:17:23.742745 189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]
9月 27 11:17:33 comput1 sycri[189766]: I0927 11:17:33.743061 189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]
you can see , execution process doesn't have stage 3
and unexpected container state: 4
,
can you explain what's wrong with it?
Is that for sashayakovtseva/test/image-server
? That image was built for amd64
, so if you are scheduling it to arm
it fails to start.
Is that for
sashayakovtseva/test/image-server
? That image was built foramd64
, so if you are scheduling it toarm
it fails to start.
i change arch to amd64, and that error is not like image compatible.the core error is what i have show."StartContainer" for"image-server" with RunContainerError: "could not start container: unexpected container state: 4"
I see this error, but that only means container fails to start. The reason is somewhere in the logs hopefully. However, I think you may not see some of them because of #360.
I would appreciate if you can try to run the image with Singularity directly on that host and paste full output here:
$ singularity pull image.sif library://sashayakovtseva/test/image-server:latest
$ sudo singularity oci mount image.sif image
$ sudo singularity -d oci run -b image server
output seems like normal.
DEBUG [U=0,P=243837] createConfDir() /root/.singularity already exists. Not creating.
VERBOSE [U=0,P=243848] print() Set messagelevel to: 5
VERBOSE [U=0,P=243848] init() Starter initialization
DEBUG [U=0,P=243848] get_pipe_exec_fd() PIPE_EXEC_FD value: 8
VERBOSE [U=0,P=243848] is_suid() Check if we are running as setuid
DEBUG [U=0,P=243848] init() Read engine configuration
DEBUG [U=0,P=243848] init() Wait completion of stage1
DEBUG [U=0,P=243849] set_parent_death_signal() Set parent death signal to 9
VERBOSE [U=0,P=243849] init() Spawn stage 1
DEBUG [U=0,P=243849] startup() oci runtime engine selected
VERBOSE [U=0,P=243849] startup() Execute stage 1
DEBUG [U=0,P=243849] StageOne() Entering stage 1
VERBOSE [U=0,P=243848] wait_child() stage 1 exited with status 0
DEBUG [U=0,P=243848] cleanup_fd() Close file descriptor 4
DEBUG [U=0,P=243848] init() Set child signal mask
VERBOSE [U=0,P=243848] init() Run as instance
DEBUG [U=0,P=243856] init() Create socketpair for master communication channel
DEBUG [U=0,P=243856] init() Create RPC socketpair for communication between stage 2 and RPC server
VERBOSE [U=0,P=243856] priv_escalate() Get root privileges
VERBOSE [U=0,P=243856] priv_escalate() Change filesystem uid to 0
VERBOSE [U=0,P=243856] pid_namespace_init() Create pid namespace
VERBOSE [U=0,P=243856] init() Spawn master process
DEBUG [U=0,P=1] set_parent_death_signal() Set parent death signal to 9
VERBOSE [U=0,P=1] create_namespace() Create network namespace
VERBOSE [U=0,P=1] create_namespace() Create uts namespace
VERBOSE [U=0,P=1] create_namespace() Create ipc namespace
VERBOSE [U=0,P=1] create_namespace() Create mount namespace
DEBUG [U=0,P=2] set_parent_death_signal() Set parent death signal to 9
VERBOSE [U=0,P=2] init() Spawn RPC server
DEBUG [U=0,P=243856] startup() oci runtime engine selected
VERBOSE [U=0,P=243856] startup() Execute master process
DEBUG [U=0,P=243856] func1() Using singularity directory "/root/.singularity"
DEBUG [U=0,P=2] startup() oci runtime engine selected
VERBOSE [U=0,P=2] startup() Serve RPC requests
DEBUG [U=0,P=243856] addRootfsMount() Parent rootfs: /run/singularity/containers/image/rootfs
DEBUG [U=0,P=243856] CreateContainer() Mount all
DEBUG [U=0,P=243856] mount() Checking if /proc/243857/root/run/singularity/containers/image/rootfs exists
DEBUG [U=0,P=243856] mount() Mount /run/singularity/containers/image/rootfs to /run/singularity/containers/image/rootfs : []
DEBUG [U=0,P=243856] mount() Checking if /proc/243857/root/run/singularity/containers/image/rootfs/proc exists
DEBUG [U=0,P=243856] mount() Mount proc to /run/singularity/containers/image/rootfs/proc : proc []
DEBUG [U=0,P=243856] mount() Checking if /proc/243857/root/run/singularity/containers/image/rootfs/dev exists
DEBUG [U=0,P=243856] mount() Mount tmpfs to /run/singularity/containers/image/rootfs/dev : tmpfs [mode=755,size=65536k]
DEBUG [U=0,P=243856] mount() Checking if /proc/243857/root/run/singularity/containers/image/rootfs/dev/pts exists
DEBUG [U=0,P=243856] mount() Creating /proc/243857/root/run/singularity/containers/image/rootfs/dev/pts
DEBUG [U=0,P=243856] mount() Mount devpts to /run/singularity/containers/image/rootfs/dev/pts : devpts [newinstance,ptmxmode=0666,mode=0620,gid=5]
DEBUG [U=0,P=243856] mount() Checking if /proc/243857/root/run/singularity/containers/image/rootfs/dev/shm exists
DEBUG [U=0,P=243856] mount() Creating /proc/243857/root/run/singularity/containers/image/rootfs/dev/shm
DEBUG [U=0,P=243856] mount() Mount shm to /run/singularity/containers/image/rootfs/dev/shm : tmpfs [mode=1777,size=65536k]
DEBUG [U=0,P=243856] mount() Checking if /proc/243857/root/run/singularity/containers/image/rootfs/dev/mqueue exists
DEBUG [U=0,P=243856] mount() Creating /proc/243857/root/run/singularity/containers/image/rootfs/dev/mqueue
DEBUG [U=0,P=243856] mount() Mount mqueue to /run/singularity/containers/image/rootfs/dev/mqueue : mqueue []
DEBUG [U=0,P=243856] mount() Checking if /proc/243857/root/run/singularity/containers/image/rootfs/sys exists
DEBUG [U=0,P=243856] mount() Mount sysfs to /run/singularity/containers/image/rootfs/sys : sysfs []
DEBUG [U=0,P=2] Chroot() Change current directory to /run/singularity/containers/image/rootfs
DEBUG [U=0,P=2] Chroot() Hold reference to host / directory
DEBUG [U=0,P=2] Chroot() Called pivot_root on /run/singularity/containers/image/rootfs
DEBUG [U=0,P=2] Chroot() Change current directory to host / directory
DEBUG [U=0,P=2] Chroot() Apply slave mount propagation for host / directory
DEBUG [U=0,P=2] Chroot() Called unmount(/, syscall.MNT_DETACH)
DEBUG [U=0,P=2] Chroot() Changing directory to / to avoid getpwd issues
VERBOSE [U=0,P=1] wait_child() rpc server exited with status 0
DEBUG [U=0,P=1] apply_container_privileges() Set main group ID to 0
DEBUG [U=0,P=1] apply_container_privileges() Set 1 additional group IDs
DEBUG [U=0,P=1] apply_container_privileges() Set user ID to 0
DEBUG [U=0,P=1] set_parent_death_signal() Set parent death signal to 9
DEBUG [U=0,P=1] startup() oci runtime engine selected
VERBOSE [U=0,P=1] startup() Execute stage 2
DEBUG [U=0,P=1] StageTwo() Entering stage 2
2019/09/27 10:05:29 Listening on 8080
my kubelet.service config is
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
[Service]
WorkingDirectory=/var/lib/kubelet
ExecStart=/opt/kube/bin/kubelet \
--address=10.18.127.1 \
--allow-privileged=true \
--anonymous-auth=false \
--authentication-token-webhook \
--authorization-mode=Webhook \
--client-ca-file=/etc/kubernetes/ssl/ca.pem \
--cluster-dns=10.70.0.2 \
--cluster-domain=cluster.local. \
--cni-bin-dir=/opt/kube/bin \
--cni-conf-dir=/etc/cni/net.d \
--fail-swap-on=false \
--hairpin-mode hairpin-veth \
--hostname-override=10.18.127.1 \
--kubeconfig=/etc/kubernetes/kubelet.kubeconfig \
--max-pods=110 \
--network-plugin=cni \
--pod-infra-container-image=mirrorgooglecontainers/pause-amd64:3.1 \
--register-node=true \
--root-dir=/var/lib/kubelet \
--tls-cert-file=/etc/kubernetes/ssl/kubelet.pem \
--tls-private-key-file=/etc/kubernetes/ssl/kubelet-key.pem \
--v=2 \
--container-runtime=remote \
--container-runtime-endpoint=unix:///var/run/singularity.sock \
--image-service-endpoint=unix:///var/run/singularity.sock
ExecStartPost=/sbin/iptables -A INPUT -s 10.0.0.0/8 -p tcp --dport 4194 -j ACCEPT
ExecStartPost=/sbin/iptables -A INPUT -s 172.16.0.0/12 -p tcp --dport 4194 -j ACCEPT
ExecStartPost=/sbin/iptables -A INPUT -s 192.168.0.0/16 -p tcp --dport 4194 -j ACCEPT
ExecStartPost=/sbin/iptables -A INPUT -p tcp --dport 4194 -j DROP
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Weird.. While I am working on fix for #360, could you try to launch image-service with an allocated tty, i.e.:
apiVersion: apps/v1
kind: Deployment
metadata:
name: image-service-deployment
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: image-service
template:
metadata:
labels:
app: image-service
name: image-service
namespace: default
spec:
containers:
- name: image-server
image: cloud.sylabs.io/sashayakovtseva/test/image-server
ports:
- containerPort: 8080
tty: true
securityContext:
runAsUser: 1000
nodeSelector:
kubernetes.io/hostname: 10.18.127.1
This should prevent from logs being truncated. Then post the output here please.
sycri only has the same output
9月 29 09:05:29 comput1 sycri[189766]: I0929 09:05:29.825818 189766 sync.go:76] Received state 2 at /var/run/singularity/containers/c62c56aac26ffd846a1ce884dce7a320b0aafa057f062637ebb69a357ba76669/sync.sock
9月 29 09:05:29 comput1 sycri[189766]: I0929 09:05:29.829802 189766 client_oci.go:125] Stream copying returned: context canceled
9月 29 09:05:29 comput1 sycri[189766]: I0929 09:05:29.928710 189766 container.go:288] Starting container c62c56aac26ffd846a1ce884dce7a320b0aafa057f062637ebb69a357ba76669
9月 29 09:05:29 comput1 sycri[189766]: I0929 09:05:29.928841 189766 client.go:87] Executing [singularity -d oci start c62c56aac26ffd846a1ce884dce7a320b0aafa057f062637ebb69a357ba76669]
9月 29 09:05:29 comput1 sycri[189766]: DEBUG [U=0,P=332186] createConfDir() /root/.singularity already exists. Not creating.
9月 29 09:05:30 comput1 sycri[189766]: I0929 09:05:30.017106 189766 sync.go:76] Received state 4 at /var/run/singularity/containers/c62c56aac26ffd846a1ce884dce7a320b0aafa057f062637ebb69a357ba76669/sync.sock
9月 29 09:05:30 comput1 sycri[189766]: E0929 09:05:30.025262 189766 main.go:276] /runtime.v1alpha2.RuntimeService/StartContainer
9月 29 09:05:30 comput1 sycri[189766]: Request: {"container_id":"c62c56aac26ffd846a1ce884dce7a320b0aafa057f062637ebb69a357ba76669"}
9月 29 09:05:30 comput1 sycri[189766]: Response: null
9月 29 09:05:30 comput1 sycri[189766]: Error: rpc error: code = Internal desc = could not start container: unexpected container state: 4
9月 29 09:05:31 comput1 sycri[189766]: I0929 09:05:31.394787 189766 container_files.go:104] Removing bundle at /var/run/singularity/containers/645be6b9aadf616290a5c038d3f435c4d7b2fe078ac1a47941ce1158a06b3370/bundle
9月 29 09:05:31 comput1 sycri[189766]: I0929 09:05:31.442043 189766 container_files.go:118] Removing container base directory /var/run/singularity/containers/645be6b9aadf616290a5c038d3f435c4d7b2fe078ac1a47941ce1158a06b3370
9月 29 09:05:33 comput1 sycri[189766]: I0929 09:05:33.743126 189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]
9月 29 09:05:43 comput1 sycri[189766]: I0929 09:05:43.742937 189766 client_oci.go:166] Executing [singularity -d oci exec 620068bdf0dadcb5dc6409ab9ac27854a54745bebec5914d05f03debf2033c73 /bin/calico-node -bird-ready -felix-ready]
What do container logs show?
the problem has been found, may be shell in container execution time is too short to logs shows unexpected container state: 4
. If i addsleep 30
for example, the pod status is expected running
Is that a bug? because container has exposed port, pod status should be running.
Execution time will not result in an unexpected container state
.
Try to fetch pod logs (they remain even if container is recreated) and also please provide the shell script you are trying to run.
Btw there is a beta7 version, so feel free to update singularity-cri :)
hi @sashayakovtseva , i try to execute the yaml file like this:
apiVersion: v1
kind: Pod
metadata:
name: mpi-worker-03
spec:
containers:
- command:
- /bin/sh
- -c
- |
mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
image: volcanosh/example-mpi:0.0.1
name: mpi-worker-03
ports:
- containerPort: 22
name: mpijob-port
workingDir: /home
nodeSelector:
kubernetes.io/hostname: k8s03
restartPolicy: OnFailure
And i find pod status is CrashLoopBackOff, but when append shell sleep 600
in container like mkdir -p /var/run/sshd; /usr/sbin/sshd -D; sleep 600
, that pod status is expectedrunning
,after 600s the pod status isCompleted
. Whatever appending sleep 600
or not, i run the same yaml file with docker runtime , the pod status always is Running
because we set containerPort in yaml. If you are free you can try it. And I will be very grateful to you for answering my doubts.
@malixian @sashayakovtseva I setup one node for testing, also have the same issue. Would you share some update ?
This is my environment:
# kubectl describe no amax
Name: amax
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=amax
kubernetes.io/os=linux
node-role.kubernetes.io/master=
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/singularity.sock
node.alpha.kubernetes.io/ttl: 0
projectcalico.org/IPv4Address: 192.168.118.45/24
projectcalico.org/IPv4IPIPTunnelAddr: 192.168.195.192
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 26 Mar 2020 18:26:52 +0800
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: amax
AcquireTime: <unset>
RenewTime: Fri, 27 Mar 2020 11:20:07 +0800
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Fri, 27 Mar 2020 09:41:40 +0800 Fri, 27 Mar 2020 09:41:40 +0800 CalicoIsUp Calico is running on this node
MemoryPressure False Fri, 27 Mar 2020 11:15:31 +0800 Thu, 26 Mar 2020 18:26:49 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Fri, 27 Mar 2020 11:15:31 +0800 Thu, 26 Mar 2020 18:26:49 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Fri, 27 Mar 2020 11:15:31 +0800 Thu, 26 Mar 2020 18:26:49 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Fri, 27 Mar 2020 11:15:31 +0800 Thu, 26 Mar 2020 18:31:16 +0800 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 192.168.118.45
Hostname: amax
Capacity:
cpu: 8
ephemeral-storage: 95800732Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7137484Ki
pods: 110
Allocatable:
cpu: 8
ephemeral-storage: 88289954466
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7035084Ki
pods: 110
System Info:
Machine ID: 0d2830c8aec14057baf9ef5796780648
System UUID: 7F437ED2-FA52-4367-AD96-06C00FF55E38
Boot ID: 9300e908-a60c-4943-9fdf-d4d84c4f4506
Kernel Version: 4.15.0-30deepin-generic
OS Image: Deepin 15
Operating System: linux
Architecture: amd64
Container Runtime Version: singularity://3.5.3
Kubelet Version: v1.17.4
Kube-Proxy Version: v1.17.4
PodCIDR: 192.168.0.0/24
PodCIDRs: 192.168.0.0/24
Non-terminated Pods: (18 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
default centos7-758596459c-ncmjh 0 (0%) 0 (0%) 0 (0%) 0 (0%) 8m54s
default hello 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16h
default hello-kubernetes-8764bc78f-ff84z 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16h
default hello-kubernetes-8764bc78f-ql4d7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16h
default hello-kubernetes-8764bc78f-z8cpc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16h
default image-service-deployment-766979ff9d-2rwbw 0 (0%) 0 (0%) 0 (0%) 0 (0%) 19m
default sif-scheduler-extender 0 (0%) 0 (0%) 0 (0%) 0 (0%) 26m
kube-system calico-kube-controllers-bc44d789c-hszpp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16h
kube-system calico-node-ckhmq 250m (3%) 0 (0%) 0 (0%) 0 (0%) 16h
kube-system coredns-9d85f5447-g689n 100m (1%) 0 (0%) 70Mi (1%) 170Mi (2%) 16h
kube-system coredns-9d85f5447-lj9td 100m (1%) 0 (0%) 70Mi (1%) 170Mi (2%) 16h
kube-system etcd-amax 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16h
kube-system kube-apiserver-amax 250m (3%) 0 (0%) 0 (0%) 0 (0%) 16h
kube-system kube-controller-manager-amax 200m (2%) 0 (0%) 0 (0%) 0 (0%) 16h
kube-system kube-proxy-jjdvj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16h
kube-system kube-scheduler-amax 100m (1%) 0 (0%) 0 (0%) 0 (0%) 20m
kubernetes-dashboard dashboard-metrics-scraper-7b8b58dc8b-6ddnb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16h
kubernetes-dashboard kubernetes-dashboard-755dcb9575-99ktd 0 (0%) 0 (0%) 0 (0%) 0 (0%) 16h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1 (12%) 0 (0%)
memory 140Mi (2%) 340Mi (4%)
ephemeral-storage 0 (0%) 0 (0%)
Events: <none>
I test the example/k8s/image-service.yaml, get this error log:
# kubectl describe pod image-service-deployment-766979ff9d-2rwbw
Name: image-service-deployment-766979ff9d-2rwbw
Namespace: default
Priority: 0
Node: amax/192.168.118.45
Start Time: Fri, 27 Mar 2020 11:01:04 +0800
Labels: app=image-service
pod-template-hash=766979ff9d
Annotations: cni.projectcalico.org/podIP: 192.168.195.217/32
Status: Running
IP: 192.168.195.217
IPs:
IP: 192.168.195.217
Controlled By: ReplicaSet/image-service-deployment-766979ff9d
Containers:
image-server:
Container ID: singularity://d5216a3a308a4ae5a30289ff01b9c60c9fc0603ffc0a99072575d63918edac6e
Image: cloud.sylabs.io/sashayakovtseva/test/image-server
Image ID: cf5d9eea227371037e614fc7dec7c1f437a6398f9b08250b89ef5c92aab7e737
Port: 8080/TCP
Host Port: 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Message: exited with code 255
Exit Code: 255
Started: Thu, 01 Jan 1970 08:00:00 +0800
Finished: Fri, 27 Mar 2020 11:22:39 +0800
Ready: False
Restart Count: 9
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-bkrkc (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-bkrkc:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-bkrkc
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned default/image-service-deployment-766979ff9d-2rwbw to amax
Normal Created 24m (x4 over 25m) kubelet, amax Created container image-server
Warning Failed 24m (x4 over 25m) kubelet, amax Error: could not start container: unexpected container state: exited
Normal Pulling 23m (x5 over 25m) kubelet, amax Pulling image "cloud.sylabs.io/sashayakovtseva/test/image-server"
Normal Pulled 23m (x5 over 25m) kubelet, amax Successfully pulled image "cloud.sylabs.io/sashayakovtseva/test/image-server"
Warning BackOff 12s (x112 over 25m) kubelet, amax Back-off restarting failed container
my kubelet.service config is:
[Unit] Description=Kubernetes Kubelet Documentation=https://github.com/GoogleCloudPlatform/kubernetes After=docker.service Requires=docker.service
[Service] WorkingDirectory=/var/lib/kubelet ExecStart=/opt/kube/bin/kubelet \ --address=10.2.152.182 \ --allow-privileged=true \ --anonymous-auth=false \ --authentication-token-webhook \ --authorization-mode=Webhook \ --client-ca-file=/etc/kubernetes/ssl/ca.pem \ --cluster-dns=10.70.0.2 \ --cluster-domain=cluster.local. \ --cni-bin-dir=/opt/kube/bin \ --cni-conf-dir=/etc/cni/net.d \ --fail-swap-on=false \ --hairpin-mode hairpin-veth \ --hostname-override=10.2.152.182 \ --kubeconfig=/etc/kubernetes/kubelet.kubeconfig \ --max-pods=110 \ --network-plugin=cni \ --pod-infra-container-image=mirrorgooglecontainers/pause-amd64:3.1 \ --register-node=true \ --root-dir=/var/lib/kubelet \ --tls-cert-file=/etc/kubernetes/ssl/kubelet.pem \ --tls-private-key-file=/etc/kubernetes/ssl/kubelet-key.pem \ --v=2 \ --container-runtime=remote \ --container-runtime-endpoint=unix:///var/run/singularity.sock \ --image-service-endpoint=unix:///var/run/singularity.sock
ExecStartPost=/sbin/iptables -A INPUT -s 10.0.0.0/8 -p tcp --dport 4194 -j ACCEPT ExecStartPost=/sbin/iptables -A INPUT -s 172.16.0.0/12 -p tcp --dport 4194 -j ACCEPT ExecStartPost=/sbin/iptables -A INPUT -s 192.168.0.0/16 -p tcp --dport 4194 -j ACCEPT ExecStartPost=/sbin/iptables -A INPUT -p tcp --dport 4194 -j DROP Restart=on-failure RestartSec=5
[Install] WantedBy=multi-user.target
i think this config is ok, but pod events have some error:
Error: could not create container: could not spawn container: could not create oci bundle: could not create SIF bundle: failed to load SIF image /var/lib/singularity/cf5d9eea227371037e614fc7dec7c1f437a6398f9b08250b89ef5c92aab7e737: image format not recognized