Closed liggitt closed 8 years ago
It is likely the container status exists, but no container id is set in the status, which would mean the split call returns []{""}
Observed on 1.1.0, after a node reboot. The node has been unscheduled first, and the pods evacuated before reboot.
Following advice from Jordan @liggitt:
oc get pods --all-namespaces -o json | jq '.items[] | select(.status.containerStatuses | (length > 0 and .[0].containerID == null)) | {namespace:.metadata.namespace,name:.metadata.name}'
returns
{
"namespace": "default",
"name": "docker-registry-1-kg2kt"
}
And here's the container status:
status:
conditions:
- lastProbeTime: null
lastTransitionTime: 2015-12-05T18:08:08Z
message: 'containers with unready status: [registry]'
reason: ContainersNotReady
status: "False"
type: Ready
containerStatuses:
- image: openshift/origin-docker-registry:v1.1
imageID: ""
lastState: {}
name: registry
ready: false
restartCount: 0
state:
waiting:
message: 'Image: openshift/origin-docker-registry:v1.1 is ready, container
is creating'
reason: ContainerCreating
hostIP: 172.29.100.140
phase: Pending
startTime: 2015-12-05T18:08:08Z
The container was failing because glusterfs endpoints have disappeared (https://github.com/openshift/origin/issues/6070).
Once the endpoints recreated, and registry started automatically, and the panic was gone.
From an SDN perspective, this means any node that hits this when any pod is in a state like this will panic. Needs to be fixed asap
Can't agree more :)
@rajatchopra @pravisankar @dcbw this is must-fix for 1.1.1/3.1.1
@liggitt Do you think your issue should have been fixed by this patch? https://github.com/openshift/openshift-sdn/pull/214/
I can reproduce the panic error without the fix in #214 and the issue gone after the fix merged.
This is fixed by #214 but we still need to pull it in Origin.
Actually it's also fixed in Origin after https://github.com/openshift/origin/pull/6060
yes, that's the same issue
From user report, panic in newSDNPod on this line: