Cluster Loader (test/extended/cluster/cl.go) does not report the correct number of pods in Pending state

wabouhamad commented 6 years ago

Cluster Loader (test/extended/cluster/cl.go) does not report the correct number of pods it created when they are in Pending state. It reports success immediately after creating the pods and exits.

Version

openshift v3.9.0-0.45.0 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.8

Steps To Reproduce

create a config file with pod template containing pod affinity rules and where pods will fail scheduling and remain in pending state.

See config files in additional information section

Run cluster loader with config file: /usr/libexec/atomic-openshift/extended.test --ginkgo.focus="Load cluster" --viper-config=config/golang/pod-affinity-security-in-s1

Current Result

Pods get created and stay stuck in pending state. Cluster Loader exits with success result:

Ran 1 of 444 Specs in 6.740 seconds SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 443 Skipped PASS

Expected Result

Should show correct number of pending pods or wait for pods to get into running state before exiting.

Additional Information

~/svt/openshift_scalability/config/golang # cat pod-affinity-security-in-s1.yaml

provider: local
ClusterLoader:
  cleanup: false
  projects:
    - num: 1
      basename: pod-affinity-
      tuning: default
      ifexists: delete
      pods:
        - num: 10
          image: docker.io/ocpqe/hello-pod
          basename: pod-affinity-security-in-s1-
          file: pod-pod-affinity-in-s1.json

  tuningsets:
    - name: default
      pods:
        stepping:
          stepsize: 40
          pause: 180
        ratelimit:
          delay: 0

cat ../../content/pod-pod-affinity-in-s1.json

{
  "kind": "Pod",
  "apiVersion": "v1",
  "metadata": {
    "name": "pod-affinity-ins1",
    "creationTimestamp": null,
    "labels": {
      "security": "s1"
    }
  },
  "spec": {
    "containers": [
      {
        "name": "hello-pod",
        "image": "docker.io/ocpqe/hello-pod",
        "ports": [
          {
            "containerPort": 8080,
            "protocol": "TCP"
          }
        ],
        "resources": {
                "requests": {
                    "cpu" : "15m",
                    "memory": "50Mi"
                },
                "limits": {
                    "cpu" : "15m",
                    "memory": "50Mi"
                }
        },
        "terminationMessagePath": "/dev/termination-log",
        "imagePullPolicy": "IfNotPresent",
        "capabilities": {},
        "securityContext": {
          "capabilities": {},
          "privileged": false
        }
      }
    ],
    "restartPolicy": "Always",
    "dnsPolicy": "ClusterFirst",
    "serviceAccount": "",
    "affinity": {
        "podAffinity": {
            "requiredDuringSchedulingIgnoredDuringExecution": [
                {
                    "labelSelector": {
                        "matchExpressions": [
                            {
                                "key": "security",
                                "operator": "In",
                                "values": [
                                    "s1"
                                ]
                            }
                        ]
                    },
                    "topologyKey": "kubernetes.io/hostname"
                }
            ]
        }
    }

  },

  "status": {}
}

--------------------------------- output:


/usr/libexec/atomic-openshift/extended.test --ginkgo.focus="Load cluster" --viper-config=config/golang/pod-affinity-security-in-s1
I0219 16:52:27.564670  115462 test.go:94] Extended test version v3.9.0-0.45.0
Running Suite: Extended
=======================
Random Seed: 1519059147 - Will randomize all specs
Will run 1 of 444 specs

I0219 16:52:27.638893  115462 e2e.go:56] The --provider flag is not set.  Treating as a conformance test.  Some tests may not be run.
Feb 19 16:52:27.638: INFO: >>> kubeConfig: /root/.kube/config
Feb 19 16:52:27.641: INFO: Waiting up to 4h0m0s for all (but 0) nodes to be schedulable
Feb 19 16:52:27.679: INFO: Waiting up to 10m0s for all pods (need at least 0) in namespace 'kube-system' to be running and ready
Feb 19 16:52:27.684: INFO: 0 / 0 pods in namespace 'kube-system' are running and ready (0 seconds elapsed)
Feb 19 16:52:27.684: INFO: expected 0 pod replicas in namespace 'kube-system', 0 are Running and Ready.
Feb 19 16:52:27.686: INFO: Waiting for pods to enter Success, but no pods in "kube-system" match label map[name:e2e-image-puller]
Feb 19 16:52:27.686: INFO: Dumping network health container logs from all nodes...
Feb 19 16:52:27.687: INFO: e2e test version: v1.9.1+a0ce1bc657
Feb 19 16:52:27.688: INFO: kube-apiserver version: v1.9.1+a0ce1bc657
I0219 16:52:27.688222  115462 e2e.go:56] The --provider flag is not set.  Treating as a conformance test.  Some tests may not be run.
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
[Feature:Performance][Serial][Slow] Load cluster 
  should load the cluster [Suite:openshift]
  /builddir/build/BUILD/atomic-openshift-git-0.b4a0c68/_output/local/go/src/github.com/openshift/origin/test/extended/cluster/cl.go:58
[BeforeEach] [Top Level]
  /builddir/build/BUILD/atomic-openshift-git-0.b4a0c68/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:53
[BeforeEach] [Feature:Performance][Serial][Slow] Load cluster
  /builddir/build/BUILD/atomic-openshift-git-0.b4a0c68/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:134
STEP: Creating a kubernetes client
Feb 19 16:52:27.691: INFO: >>> kubeConfig: /root/.kube/config
STEP: Building a namespace api object
Feb 19 16:52:27.721: INFO: configPath is now "/tmp/extended-test-cl-jtq78-74btw-user.kubeconfig"
Feb 19 16:52:27.721: INFO: The user is now "extended-test-cl-jtq78-74btw-user"
Feb 19 16:52:27.721: INFO: Creating project "extended-test-cl-jtq78-74btw"
Feb 19 16:52:27.817: INFO: Waiting on permissions in project "extended-test-cl-jtq78-74btw" ...
Feb 19 16:52:27.831: INFO: No PodSecurityPolicies found; assuming PodSecurityPolicy is disabled.
STEP: Waiting for a default service account to be provisioned in namespace
[BeforeEach] [Feature:Performance][Serial][Slow] Load cluster
  /builddir/build/BUILD/atomic-openshift-git-0.b4a0c68/_output/local/go/src/github.com/openshift/origin/test/extended/cluster/cl.go:40
Feb 19 16:52:27.922: INFO: Using config config/golang/pod-affinity-security-in-s1

[It] should load the cluster [Suite:openshift]
  /builddir/build/BUILD/atomic-openshift-git-0.b4a0c68/_output/local/go/src/github.com/openshift/origin/test/extended/cluster/cl.go:58
Feb 19 16:52:27.923: INFO: Our tuning set is: &{default {{40 180ns 0s} {0s}} {{0 0s 0s} {0s}}}
Feb 19 16:52:27.923: INFO: Running 'oc new-project --config=/tmp/extended-test-cl-jtq78-74btw-user.kubeconfig --namespace=extended-test-cl-jtq78-74btw pod-affinity-0'
Now using project "pod-affinity-0" on server "https://ip-172-31-24-19.us-west-2.compute.internal:8443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git

to build a new example application in Ruby.
Feb 19 16:52:28.165: INFO: 1/1 : Created new namespace: pod-affinity-0
Feb 19 16:52:28.165: INFO: The loaded config file is: [{Name:hello-pod Image:docker.io/ocpqe/hello-pod Command:[] Args:[] WorkingDir: Ports:[{Name: HostPort:0 ContainerPort:8080 Protocol:TCP HostIP:}] EnvFrom:[] Env:[] Resources:{Limits:map[cpu:{i:{value:15 scale:-3} d:{Dec:<nil>} s:15m Format:DecimalSI} memory:{i:{value:52428800 scale:0} d:{Dec:<nil>} s:50Mi Format:BinarySI}] Requests:map[memory:{i:{value:52428800 scale:0} d:{Dec:<nil>} s:50Mi Format:BinarySI} cpu:{i:{value:15 scale:-3} d:{Dec:<nil>} s:15m Format:DecimalSI}]} VolumeMounts:[] VolumeDevices:[] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy: ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[],},Privileged:*false,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,} Stdin:false StdinOnce:false TTY:false}]
Feb 19 16:52:28.166: INFO: Pod environment variables will not be modified.
Feb 19 16:52:28.166: INFO: 1/10 : Creating pod
Feb 19 16:52:28.179: INFO: 2/10 : Creating pod
Feb 19 16:52:28.187: INFO: 3/10 : Creating pod
Feb 19 16:52:28.194: INFO: 4/10 : Creating pod
Feb 19 16:52:28.202: INFO: 5/10 : Creating pod
Feb 19 16:52:28.210: INFO: 6/10 : Creating pod
Feb 19 16:52:28.220: INFO: 7/10 : Creating pod
Feb 19 16:52:28.226: INFO: 8/10 : Creating pod
Feb 19 16:52:28.234: INFO: 9/10 : Creating pod
Feb 19 16:52:28.241: INFO: 10/10 : Creating pod
Feb 19 16:52:28.254: INFO: Cluster loading duration: 331.375285ms
[AfterEach] [Feature:Performance][Serial][Slow] Load cluster
  /builddir/build/BUILD/atomic-openshift-git-0.b4a0c68/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:135
Feb 19 16:52:28.255: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
STEP: Destroying namespace "extended-test-cl-jtq78-74btw" for this suite.
Feb 19 16:52:34.268: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Feb 19 16:52:34.318: INFO: namespace: extended-test-cl-jtq78-74btw, resource: bindings, ignored listing per whitelist
Feb 19 16:52:34.378: INFO: namespace extended-test-cl-jtq78-74btw deletion completed in 6.120433511s

• [SLOW TEST:6.687 seconds]
[Feature:Performance][Serial][Slow] Load cluster
/builddir/build/BUILD/atomic-openshift-git-0.b4a0c68/_output/local/go/src/github.com/openshift/origin/test/extended/cluster/cl.go:27
  should load the cluster [Suite:openshift]
  /builddir/build/BUILD/atomic-openshift-git-0.b4a0c68/_output/local/go/src/github.com/openshift/origin/test/extended/cluster/cl.go:58
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSFeb 19 16:52:34.378: INFO: Running AfterSuite actions on all node
Feb 19 16:52:34.378: INFO: Running AfterSuite actions on node 1

Ran 1 of 444 Specs in 6.740 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 443 Skipped PASS

============

oc get pods -n pod-affinity-0 
NAME                                 READY     STATUS    RESTARTS   AGE
pod-affinity-security-in-s1--pod-0   0/1       Pending   0          1m
pod-affinity-security-in-s1--pod-1   0/1       Pending   0          1m
pod-affinity-security-in-s1--pod-2   0/1       Pending   0          1m
pod-affinity-security-in-s1--pod-3   0/1       Pending   0          1m
pod-affinity-security-in-s1--pod-4   0/1       Pending   0          1m
pod-affinity-security-in-s1--pod-5   0/1       Pending   0          1m
pod-affinity-security-in-s1--pod-6   0/1       Pending   0          1m
pod-affinity-security-in-s1--pod-7   0/1       Pending   0          1m
pod-affinity-security-in-s1--pod-8   0/1       Pending   0          1m
pod-affinity-security-in-s1--pod-9   0/1       Pending   0          1m

wabouhamad commented 6 years ago

@sjug PTAL

sjug commented 6 years ago

Hey @wabouhamad, right now we only wait for pods to be in Running state when your step size has been met. In this instance if you changed your TuningSet step size to 10 rather than 40, it would wait.

If you want to give that a try to verify it works, that's the current implementation and not a bug. If we want to change behaviour, we should open a RFE instead.

wabouhamad commented 6 years ago

Hey @sjug, with step size 10 it actually waited indefinitely since the pods were stuck in Pending state. When I hit control-C to kill it, it exited with a Failure message but it still shows 1 Passed 0 Failed 0 Pending:

Ran 1 of 99 Specs in 647.201 seconds FAIL! -- 1 Passed | 0 Failed | 0 Pending | 98 Skipped

Is this expected behavior ?

Feb 19 20:28:13.265: INFO: 1/1 : Created new namespace: pod-affinity-0
Feb 19 20:28:13.266: INFO: The loaded config file is: [{Name:hello-pod Image:docker.io/ocpqe/hello-pod Command:[] Args:[] WorkingDir: Ports:[{Name: HostPort:0 ContainerPort:8080 Protocol:TCP HostIP:}] EnvFrom:[] Env:[] Resources:{Limits:map[cpu:{i:{value:15 scale:-3} d:{Dec:<nil>} s:15m Format:DecimalSI} memory:{i:{value:52428800 scale:0} d:{Dec:<nil>} s:50Mi Format:BinarySI}] Requests:map[cpu:{i:{value:15 scale:-3} d:{Dec:<nil>} s:15m Format:DecimalSI} memory:{i:{value:52428800 scale:0} d:{Dec:<nil>} s:50Mi Format:BinarySI}]} VolumeMounts:[] VolumeDevices:[] LivenessProbe:nil ReadinessProbe:nil Lifecycle:nil TerminationMessagePath:/dev/termination-log TerminationMessagePolicy: ImagePullPolicy:IfNotPresent SecurityContext:&SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[],},Privileged:*false,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,} Stdin:false StdinOnce:false TTY:false}]
Feb 19 20:28:13.266: INFO: Pod environment variables will not be modified.
Feb 19 20:28:13.266: INFO: 1/10 : Creating pod
Feb 19 20:28:14.283: INFO: 2/10 : Creating pod
Feb 19 20:28:14.293: INFO: 3/10 : Creating pod
Feb 19 20:28:14.300: INFO: 4/10 : Creating pod
Feb 19 20:28:14.310: INFO: 5/10 : Creating pod
Feb 19 20:28:14.321: INFO: 6/10 : Creating pod
Feb 19 20:28:14.330: INFO: 7/10 : Creating pod
Feb 19 20:28:14.339: INFO: 8/10 : Creating pod
Feb 19 20:28:14.346: INFO: 9/10 : Creating pod
Feb 19 20:28:14.358: INFO: 10/10 : Creating pod
Feb 19 20:28:14.369: INFO: Waiting for pods created this step to be running
^C
---------------------------------------------------------
Received interrupt.  Running AfterSuite...
^C again to terminate immediately
Feb 19 20:38:53.947: INFO: Running AfterSuite actions on all node
Feb 19 20:38:53.947: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready
STEP: Destroying namespace "extended-test-cl-bnkh8-4nwnz" for this suite.
Feb 19 20:38:59.958: INFO: Waiting up to 30s for server preferred namespaced resources to be successfully discovered
Feb 19 20:39:00.005: INFO: namespace: extended-test-cl-bnkh8-4nwnz, resource: bindings, ignored listing per whitelist
Feb 19 20:39:00.069: INFO: namespace extended-test-cl-bnkh8-4nwnz deletion completed in 6.11985045s
Feb 19 20:39:00.069: INFO: Running AfterSuite actions on node 1

Ran 1 of 99 Specs in 647.201 seconds
FAIL! -- 1 Passed | 0 Failed | 0 Pending | 98 Skipped root@ip-172-31-24-19: ~/svt/openshift_scalability #

sjug commented 6 years ago

Are you asking if the waiting indefinitely is expected or the reponse to SIGINT?

The timeout for the tuningset is user configurable so just set the tuning.Pods.Stepping.Timeout in the config and it will not wait indefinitely. As far as the SIGINT handle, at the point where it is recieved during the wait, there's nothing after that would fail. So it is expected behaviour for CL.

wabouhamad commented 6 years ago

@sjug, I was referring to the counts in the last FAIL message. Should the Failed counter be 1 and Passed 0, since none of the pods made it to running state, instead we see:

Ran 1 of 99 Specs in 647.201 seconds FAIL! -- 1 Passed | 0 Failed | 0 Pending | 98 Skipped

sjug commented 6 years ago

As far as the SIGINT handle, at the point where it is recieved during the wait, there's nothing after that would fail. So it is expected behaviour for CL.

How would it know if the pods made it into running state, you sent SIGINT during execution? As I said in my last message (quoted above) there's nothing else that would fail after that check, so therefore it passed all checks.

wabouhamad commented 6 years ago

that's right since your are not trapping the SIGINT. Closing this issue since it is expected behavior. Thx.

openshift / origin