openshift-qe / ocp-qe-perfscale-ci

OpenShift QE PerfScale CI
Apache License 2.0
9 stars 31 forks source link

Add in logging and must gathers for failed jobs #37

Open paigerube14 opened 2 years ago

paigerube14 commented 2 years ago

Similar to the Upgrade Ci we need to be able to run lots of jobs and log issues with out the cluster being around

It would be helpful to print off logs and maybe a must gather in certain cases to be able to properly open bugs

Some thoughts:

paigerube14 commented 2 years ago

This issue Simon opened and the attached open PR covers the first bullet: https://github.com/openshift-qe/ocp-qe-perfscale-ci/issues/89

paigerube14 commented 2 years ago

@skordas @rpattath @mffiedler @qiliRedHat

What are your thoughts about adding in all job calling from the loaded-upgrade job itself? i.e. have loaded upgrade do the calls to flexy(already set up), scale up (currently in each scale-ci job specifically), the scale job itself (done), cluster check (wanting to add based on this issue) upgrade (done) and destroy if necessary.

I think this would create a lot less PR's and make things easier to manage in each of the branches. For this issue specifically, I was thinking I want to add a call to cerberus to do 1 iteration to do certain checks on the cluster after the load is complete to make sure we should continue. Currently I would have to add a PR to each of the scale-ci jobs and if any parameters ever needed to be added to the cerberus job I would have to do Pr's in each scale-ci job for that update. I am suggesting adding in the call from the base loaded-upgrade job so that the loaded-upgrade job is all encompassing.

For example, the env_vars variable weren't being passed to the scale worker job from some of the different scale-ci jobs. So I had to create a PR for each place in the scale-ci jobs that were calling the scale-worker job to add in the specific variable.

Cons I can think of: lots more clicking around to each job that ran off of the main loaded-upgrade run

Overall, what are your thoughts?

skordas commented 2 years ago

I think we can have a long pipeline if everything would be good documented - that mostly for new team members. We all are growing with the changes :)

my few cents:

qiliRedHat commented 2 years ago

@paigerube14 I agree with your thinking about reducing PR. I also think we can even use loaded-upgrade job in our scale-ci test runs. This can make 2 steps 1)calling Flexy-install job 2) trigger the scale-ci job into one and save some effort. But we have to be able to pass more configurations from loaded-upgrade job to Flexy-install job, like VARIABLES_LOCATION and LAUNCHER_VARSs. At lease vm_type should be configured as in some large scale cluster, we need bigger master and worker type, like for cluster density test https://deploy-preview-41943--osdocs.netlify.app/openshift-enterprise/latest/scalability_andperformance/recommended-host-practices.html#mast[…]ing

qiliRedHat commented 2 years ago

@paigerube14 I created an Epic for our scale-ci automation enhancement work in 4.11 https://issues.redhat.com/browse/OCPQE-9141 to better organize our 4.11 work. I added your Jira task for this issue under it.