openshift / svt

Apache License 2.0
123 stars 105 forks source link

add test case for OCP-9226 conc_registry_pull #748

Closed qiliRedHat closed 1 year ago

qiliRedHat commented 1 year ago

To automate regression test case https://polarion.engineering.redhat.com/polarion/redirect/project/OSE/workitem?id=OCP-9226 OCP-9226 - Concurrent pull from the registry Jira task: https://issues.redhat.com/browse/OCPQE-13437

Steps

  1. Install registry machinesets with 2 replicas
  2. Move registry pods to the registry machinesets
  3. Use kube-burner to install x(first parameter) namespaces with cakephp-mysql-persistant application
  4. Make sure all application pods are running. If not, try to rebuild or redeploy according the the failures.
  5. Scale all namespaces's cakephp-mysql-persistant application from replica 1 to m, n, ...(second and more parameters) count the time used from scaling up till all replicas are running.
qiliRedHat commented 1 year ago

Test 1 PARAMETERS 1000 2 5 10

https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/regression-test/229/console

01-13 11:06:45.182 ====Results==== 01-13 11:06:45.182 Time taken for scaling up to 2 replicas for applications : 422 seconds 01-13 11:06:45.182 Time taken for scaling up to 5 replicas for applications : 433 seconds 01-13 11:06:45.182 Time taken for scaling up to 10 replicas for applications : 525 seconds 01-13 11:06:45.182 ====Test Passed====

Test 2 PARAMETERS 1000 20 ENV_VARS SCALE_ONLY=true

https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/regression-test/230/console 01-13 11:29:18.240 ====Results==== 01-13 11:29:18.240 Time taken for scaling up to 20 replicas for applications : 1010 seconds 01-13 11:29:18.240 ====Test Passed====

qiliRedHat commented 1 year ago

Some known issues with the test

  1. After running kubeburner, sometimes there are some build or deploy not successful. I added a fix function to find them out and rebuild or redeploy trying to make all pods running.

I read the document about deploymentconfig, https://docs.openshift.com/container-platform/4.11/applications/deployments/what-deployments-are.html#deployments-design_what-deployments-are

For DeploymentConfig objects, if a node running a deployer pod goes down, it will not get replaced. The process waits until the node comes back online or is manually deleted. Manually deleting the node also deletes the corresponding pod. This means that you can not delete the pod to unstick the rollout, as the kubelet is responsible for deleting the associated pod.

In the test I used enable_spot_instance_workers: "no" to avoid worker node recreate that may cause deployment config failure.

  1. When the total cakephp-mysql-persistent Running pod reaches 22774, no more pod can be started because of '200 Insufficient memory'. That means more than 20 replicas are not supported in 200 worker cluster with m5.4xlarge.
% oc get po -A -l deploymentconfig=cakephp-mysql-persistent | grep -c Running
22774

 % oc get po -A -l deploymentconfig=cakephp-mysql-persistent | grep -v Running | head -n 2 
NAMESPACE                 NAME                               READY   STATUS    RESTARTS      AGE
conc-registry-pull-1000   cakephp-mysql-persistent-1-542sg   0/1     Pending   0             147m

 % oc describe po -n conc-registry-pull-1000   cakephp-mysql-persistent-1-542sg

....
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  149m                 default-scheduler  0/209 nodes are available: 200 Insufficient memory, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 9 node(s) didn't match Pod's node affinity/selector. preemption: 0/209 nodes are available: 200 No preemption victims found for incoming pod, 9 Preemption is not helpful for scheduling.
  1. When there is 1000 namespaces with 40 cakephp-mysql-persistent replicas, oc can not get mack the pod list. It can still get pod list from a single test namespace. So I think that could be cause the pod number is too big for the query. 30 cakephp-mysql-persistent replicas can still work 1000 namespaces 30 replicas:
    % oc get po -A -l deploymentconfig=cakephp-mysql-persistent --no-headers | wc -l
    30000

    1000 namespaces 40 replicas:

    % oc get po -A -l deploymentconfig=akephp-mysql-persistent -v 9
    I0113 13:20:33.150360   56389 loader.go:374] Config loaded from file:  /Users/qili/Downloads/kubeconfig
    I0113 13:20:33.161010   56389 round_trippers.go:466] curl -v -XGET  -H "Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json" -H "User-Agent: oc/4.12.0 (darwin/amd64) kubernetes/854f807" 'https://api.qili-awsovn4xns.qe.devcluster.openshift.com:6443/api/v1/pods?labelSelector=deploymentconfig%3Dakephp-mysql-persistent&limit=500'
    I0113 13:20:33.168606   56389 round_trippers.go:495] HTTP Trace: DNS Lookup for api.qili-awsovn4xns.qe.devcluster.openshift.com resolved to [{52.15.154.40 } {3.17.213.176 } {3.139.154.187 }]
    I0113 13:20:33.379476   56389 round_trippers.go:510] HTTP Trace: Dial to tcp:52.15.154.40:6443 succeed
    I0113 13:20:36.168950   56389 round_trippers.go:553] GET https://api.qili-awsovn4xns.qe.devcluster.openshift.com:6443/api/v1/pods?labelSelector=deploymentconfig%3Dakephp-mysql-persistent&limit=500 200 OK in 3007 milliseconds
    I0113 13:20:36.169085   56389 round_trippers.go:570] HTTP Statistics: DNSLookup 7 ms Dial 210 ms TLSHandshake 318 ms ServerProcessing 2469 ms Duration 3007 ms
    I0113 13:20:36.169111   56389 round_trippers.go:577] Response Headers:
    I0113 13:20:36.169137   56389 round_trippers.go:580]     Date: Fri, 13 Jan 2023 05:20:36 GMT
    I0113 13:20:36.169159   56389 round_trippers.go:580]     Audit-Id: f36129c5-9c8a-4f50-8a13-12c30b7973b4
    I0113 13:20:36.169181   56389 round_trippers.go:580]     Cache-Control: no-cache, private
    I0113 13:20:36.169202   56389 round_trippers.go:580]     Content-Type: application/json
    I0113 13:20:36.169223   56389 round_trippers.go:580]     Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
    I0113 13:20:36.169246   56389 round_trippers.go:580]     X-Kubernetes-Pf-Flowschema-Uid: d0c9ddee-64a9-490a-97d7-1faa7e57d1c5
    I0113 13:20:36.169267   56389 round_trippers.go:580]     X-Kubernetes-Pf-Prioritylevel-Uid: b8b794ec-4994-46bd-9c83-0ef879e16eeb
    I0113 13:20:36.169289   56389 round_trippers.go:580]     Content-Length: 2936
    I0113 13:20:36.169550   56389 request.go:1154] Response Body: {"kind":"Table","apiVersion":"meta.k8s.io/v1","metadata":{"resourceVersion":"1340642"},"columnDefinitions":[{"name":"Name","type":"string","format":"name","description":"Name must be unique within a namespace. Is required when creating resources, although some resources may allow a client to request the generation of an appropriate name automatically. Name is primarily intended for creation idempotence and configuration definition. Cannot be updated. More info: http://kubernetes.io/docs/user-guide/identifiers#names","priority":0},{"name":"Ready","type":"string","format":"","description":"The aggregate readiness state of this pod for accepting traffic.","priority":0},{"name":"Status","type":"string","format":"","description":"The aggregate status of the containers in this pod.","priority":0},{"name":"Restarts","type":"string","format":"","description":"The number of times the containers in this pod have been restarted and when the last container in this pod has restarted.","priority":0},{"name":"Age","type":"string","format":"","description":"CreationTimestamp is a timestamp representing the server time when this object was created. It is not guaranteed to be set in happens-before order across separate operations. Clients may not set this value. It is represented in RFC3339 form and is in UTC.\n\nPopulated by the system. Read-only. Null for lists. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#metadata","priority":0},{"name":"IP","type":"string","format":"","description":"IP address allocated to the pod. Routable at least within the cluster. Empty if not yet allocated.","priority":1},{"name":"Node","type":"string","format":"","description":"NodeName is a request to schedule this pod onto a specific node. If it is non-empty, the scheduler simply schedules this pod onto that node, assuming that it fits resource requirements.","priority":1},{"name":"Nominated Node","type":"string","format":"","description":"nominatedNodeName is set only when this pod preempts other pods on the node, but it cannot be scheduled right away as preemption victims receive their graceful termination periods. This field does not guarantee that the pod will be scheduled on this node. Scheduler may decide to place the pod elsewhere if other nodes become available sooner. Scheduler may also decide to give the resources on this node to a higher priority pod that is created after preemption. As a result, this field may be different than PodSpec.nodeName when the pod is scheduled.","priority":1},{"name":"Readiness Gates","type":"string","format":"","description":"If specified, all readiness gates will be evaluated for pod readiness. A pod is ready when all its containers are ready AND all conditions specified in the readiness gates have status equal to \"True\" More info: https://git.k8s.io/enhancements/keps/sig-network/580-pod-readiness-gates","priority":1}],"rows":[]}
    No resources found
    % oc get po -n conc-registry-pull-1                                         
    NAME                                  READY   STATUS      RESTARTS   AGE
    cakephp-mysql-persistent-1-26mxp      1/1     Running     0          25m
    cakephp-mysql-persistent-1-2gjzm      1/1     Running     0          61m
    ....
qiliRedHat commented 1 year ago

@mffiedler and @paigerube14 PTAL

qiliRedHat commented 1 year ago

@paigerube14 I found this pr missed my attention and not merged after I created the test cases for 4.12.

I tested with this branch in 4.13, test passed. Please help to review and merge this PR.

New test case: OCP-9226 - Concurrent pull from the registry https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/regression-test/260/

Updated test case: OCP-26279 - [BZ 1752636] Networkpolicy should be applied for large namespaces https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/scale-ci/job/e2e-benchmarking-multibranch-pipeline/job/regression-test/259

qiliRedHat commented 1 year ago

@paigerube14 I did a rebase. Please help to review this when you have time.

paigerube14 commented 1 year ago

/lgtm