ProwJobs queue - Githubissues

mborsz commented 3 years ago

What would you like to be added: I would like to be able to queue ProwJobs run to run one after another, e.g. I create 3 ProwJobs:

A which takes 1h
B which takes 3h
C which takes 2h

I would like to B start when A finishes and then C.

Why is this needed:

As scalability team, we manage a lot of long-running jobs that share the same gcp-projects (or boskos pools). We want to be able to run them effectively and in an easy way.

While for most CI jobs we use periodic ProwJobs which is fine, for manual runs we use spreadsheets to book gcp projects, which is cumbersome.

I would like to be able to create a queue for each gcp-project (boskos pool) and say that only up to N jobs can run concurrently.

Currently I'm aware of two ways for achieving this:

Running the job using boskos pool, increasing timeout for job to [very long period = max timeout of job * max queue length], increasing boskos pool timeout to [very long period]. This way to all jobs will technically start once they are created, but only N (where N = boskos pool size) will be effectively running. I don't think it is a right solution as this requires setting very long timeouts which effectively disables timeout for actual part of the test (what if my test blocks for hours and timeout is set to [very long period]) and makes reporting hard (how I'm going to know how much time my test takes)
Renaming all the jobs to the same pj.Spec.Job and setting MaxConcurrency = N. This basically works as we want, but requires all our jobs to use the same names which isn't feasible as the names then are used to generate testgrid dashboards etc.

Suggested solution Basically extending maxConcurrency to use a custom queue key, i.e. I propose the following changes:

Adding an optional "queueName" field to ProwJob
Changing plank to use "queueName" for MaxConcurrency checks (instead of pj.Spec.Job) if present, otherwise fallback to pj.Spec.Job if "queueName" not set.

This way, we can easily set queueName field for jobs that we want to create queue, set maxConcurrency on them and they will wait for each other, as expected. Additionally this approach can work both for jobs using single gcp-project (by setting maxConcurrency=1) and using boskos (by setting maxConcurrency=pool size).

Technically this change seems to be pretty straightforward: it requires changing canExecuteConcurrently and other places around, but still very simple change.

mborsz commented 3 years ago

/area prow

mborsz commented 3 years ago

/cc @chaodaiG @alvaroaleman

Chao, Alvaro, what do you think on this feature request? Is it feasible?

chaodaiG commented 3 years ago

Would like to understand the requirements better:

While for most CI jobs we use periodic ProwJobs which is fine, for manual runs we use spreadsheets to book gcp projects, which is cumbersome.

When you mentioned periodic prowjobs, does it mean the periodic prowjobs that execute all A, B, and C in a single prowjob?

Not sure about manual runs, how would manual runs related to prow?

mborsz commented 3 years ago

Sorry, maybe I wasn't clear on our use case.

Our team (sig scalability/GKE scalability) maintain tens of periodic prowjobs (for sig scalability: https://github.com/kubernetes/test-infra/tree/master/config/jobs/kubernetes/sig-scalability).

The workflow we are using for testing new kubernetes changes is:

We pick some prowjob run from testgrid (usually ci-kubernetes-e2e-gce-scale-performance which takes ~5h and consumes 10k CPUs)
We use mkpj/rerun feature/some other tool working in similar way to prepare a configuration we can rerun
We make changes to the prepared configuration (e.g. by changing --extract line to point to kubernetes with changes we want to test or changing perf-tests branch to validate test scenario changes)
While CI jobs usually use 'kubernetes-scale' project, we change a project used for the test to use some other one (to avoid interference with CI jobs that use)
kubectl create -f prepared-configuration.yaml

This way we can easily do A/B testing (compare test results with our change and baseline from CI job).

This is what I meant by manual run. This is a common way we use to test either new kubernetes changes or infrastructure changes or test changes. The problem we are trying to solve is a project management -- we have only few project with capacity enough to run 10k core jobs and with multiple people running such manual jobs, manual project management is getting hard.

I don't want to focus on periodic jobs, as project management for them is currently solved by preparing a right cron schedule.

alvaroaleman commented 3 years ago

Hypothetically we could do something with labels (max concurrency for jobs with label foo). Would that solve your issues?

Disclaimer, I won't be implementing that (as I don't have the problem) and it would need acking from the other maintainers in terms of if we are ok with the added complexity.

mborsz commented 3 years ago

Hypothetically we could do something with labels (max concurrency for jobs with label foo). Would that solve your issues?

Do you mean adding labels like:

labels:
  prow.k8s.io/queue: foo
  prow.k8s.io/queue-max-concurrency: 5

?

If so, I think it solves our problem, but seems to be slightly more complex than the solution I proposed in first comment (where I simply reuse the existing max concurrency mechanism). Is there any particular reason why using labels is better than adding a new field to ProwJob's spec?

Disclaimer, I won't be implementing that (as I don't have the problem) and it would need acking from the other maintainers in terms of if we are ok with the added complexity.

We can make necessary code changes, once we get approval from maintainers for the approach we take.

chaodaiG commented 3 years ago

Thank you @mborsz , now I understand it better. Please correct me if I'm still getting it wrong:

There are 2 pools of Boskos projects, 1 pool for CI(periodic jobs), the other for manual testing
Manual testing creates prowjobs from ad hoc configurations that are not supposed to be source controlled
The problem to solve here is to avoid manual maintenance of the boskos pool for manual testing
One thing not super clear, is that kubectl create -f prepared-configuration.yaml, I assume this is a prowjob CR, otherwise max_concurrency would not work. Is that correct? (If the answer is yet, then the manual tester need to have admin power on prow cluster)

alvaroaleman commented 3 years ago

If so, I think it solves our problem, but seems to be slightly more complex than the solution I proposed in first comment (where I simply reuse the existing max concurrency mechanism).

Two reasons:

You might want to have both (e.G. only one job of name X and only N job of type scaletesting)
The labels as implementation detail would make the code a bit easier, as we get indexing by labels for free. However on second thought, that isn't actually very relevant because we can just add an index to our cache

mborsz commented 3 years ago

@chaodaiG you are right, prepared-configuration.yaml is a prowjob object (like e.g. https://prow.k8s.io/rerun?prowjob=40e18bcd-a77a-11eb-a9fd-4a69d306464d). You are right, testers do have admin power on prow cluster.

alvaroaleman commented 3 years ago

If you can create prowjobs anyways, why not just write a custom controller that knows everything it needs to knows and creates prowjobs? We do that in a couple of places in Openshift for things that can not reasonably be re-used upstream, because they are very downstream-specific. That would give you more flexibility and not require to get an ack from us for every change.

stevekuznetsov commented 3 years ago

The integration with tekton pipeline was supposed to solve this problem. Would a ProwJob that leverages that system be acceptable here?

chaodaiG commented 3 years ago

testers do have admin power on prow cluster.

This worries me a little bit, didn't realize before that anyone other than oncall has admin control on prow service cluster, which doesn't seem to be very secure. + @cjwagner here for awareness or opinions

jkaniuk commented 3 years ago

testers do have admin power on prow cluster.

This worries me a little bit

Mentioned Prow cluster is not part of k8s.io infrastructure.

chaodaiG commented 3 years ago

On a side node: I believe the proposed solution solves your problem, and I think it would be a good change. We can discuss further on whether a new field or a label

k8s-triage-robot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

mborsz commented 3 years ago

/remove-lifecycle stale

spiffxp commented 3 years ago

/sig testing

spiffxp commented 3 years ago

/sig scalability

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mborsz commented 2 years ago

/remove-lifecycle stale

I would still like to do this eventually.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

jupblb commented 2 years ago

/remove-lifecycle rotten

I'd like to add this feature if we reach a consensus on how it should be implemented.

kubernetes / test-infra

ProwJobs queue #21996