kubernetes-sigs / kubespray

Deploy a Production Ready Kubernetes Cluster
Apache License 2.0
15.87k stars 6.41k forks source link

Use pre-commit in CI as well to reduce deduplication #11211

Closed VannTen closed 3 months ago

VannTen commented 3 months ago

What would you like to be added

Switch some of our CI (all the tests mirroring our pre-commit hooks) to pre-commit.ci

Why is this needed

This would avoid duplication between .pre-commit-config.yaml and our .gitlab-ci/ files

It also free some of our CI resources. We should check if the pre-commit CI works correctly with the Github status API, (prow relies on that to merge PRs) and we probably don't want the autofix PR stuff.

I don't know who can add the app to the repository though, are approvers also repository admins ? @yankay @mzaian @floryut

yankay commented 3 months ago

What would you like to be added

Switch some of our CI (all the tests mirroring our pre-commit hooks) to pre-commit.ci

Why is this needed

This would avoid duplication between .pre-commit-config.yaml and our .gitlab-ci/ files

It also free some of our CI resources. We should check if the pre-commit CI works correctly with the Github status API, (prow relies on that to merge PRs) and we probably don't want the autofix PR stuff.

I don't know who can add the app to the repository though, are approvers also repository admins ? @yankay @mzaian @floryut

HI @VannTen

Great Idea.

@floryut may have the permission :-)

ant31 commented 3 months ago

I've added it to the project

See the linked PR (#11223).

Looks like the pre-commit can be run from the CI, to avoid duplication as mentioned: to add in .gitlab-ci.yml:

build:
  image: python:3.8
  script:
    - pip install pre-commit
    - pip install .
    - pre-commit run --all-files

That way it can fail on a early stage and not trigger all the tests/build if needed

VannTen commented 3 months ago

Looks like we should disable the auto-fixing PR, since it messes with the CLA stuff and all (and I'd rather have PR authors fixes their issues anyway. I think it's configurable by adding a pre-commit file to the repository.

Looks like the pre-commit can be run from the CI, to avoid duplication as mentioned: to add in .gitlab-ci.yml:

That way it can fail on a early stage and not trigger all the tests/build if needed

Well, if we also runs pre-commit in gitlab ci it's not avoiding duplications ? I'm not sure what you mean ?

Besides, if we we're to run it in Gitlab CI (that's possible as well, I was simply eyeing pre-commit.ci because it seems less work for us), we should use a dedicated image and cache the envs, otherwise it's gonna take a lot of unnecessary time. (and parallelize the runs somehow, also)

ant31 commented 3 months ago

pre-commit in gitlab ci it's not avoiding duplications ?

Which duplications are you talking about then ?

VannTen commented 3 months ago

I mean if we run it in gitlab CI and in pre-commit.ci

(Though yeah it's still deduplicating between .pre-commit.config.yaml and .gitlab-ci/ , but we should probably not run them twice, I think ?)

ant31 commented 3 months ago

My understanding is that some of the test are duplicated in .gitlab.yaml and in .pre-commit-config.yaml

For example: precommit:

  - repo: https://github.com/adrienverge/yamllint.git
    rev: v1.27.1
    hooks:
      - id: yamllint
        args: [--strict]

.gitlab-ci/lint.yaml

---
yamllint:
  extends: .job
  stage: unit-tests
  tags: [light]
  variables:
    LANG: C.UTF-8
  script:
    - yamllint --strict .
  except: ['triggers', 'master']

Instead of having them duplicated as you suggested, we can have all the pre-commit-config.yaml run as one or few jobs in the current pipeline by using the command pre-commit run --all-files

VannTen commented 3 months ago

Instead of having them duplicated as you suggested, we can have all the pre-commit-config.yaml run as one or few jobs in the current pipeline by using the command pre-commit run --all-files

Yes, there is two differents things we are talking about I think:

  1. don't duplicate .pre-commit-config.yaml into .gitlab-ci/
  2. Choose if we run pre-commit hooks from our CI (gitlab-ci) with either a single stage or using for instance a dynamic child pipeline OR in pre-commit.ci (via the Github app), and not in gitlab CI.

I think 1. is needed anyway, but for 2 I'm not sure what is the preferred option. pre-commit.ci is more rigid but we don't have to maintain it. We should also check if it block tide merge correctly.

ant31 commented 3 months ago

Yes 100% with 1. for 2., I tend to go for fewer systems than more. It will still be something to look at, configure, upgrade... Another issue is that it's no longer part of the pipeline and tests will move to the next stage regardless of the precommit.ci result. Currently, the pipeline stops at the 'unit-tests' stage. If any is failing, we should continue to do that as spinning VM and other tests are much more expensive in resources.

Untitled

VannTen commented 3 months ago

for 2., I tend to go for fewer systems than more. It will still be something to look at, configure, upgrade...

Upgrade, presumably not, but yeah, I see your point, that's valid.

Another issue is that it's no longer part of the pipeline and tests will move to the next stage regardless of the precommit.ci result. ...

I actually think we should have less dependencies between jobs, but with a better cancellation story (aka, stopping pipelines immediately or almost if there is a new push).

The question in that case is of course garbage collection.

(For failure cases, it's less clear cut, because there is usually some value in getting other jobs output even when one fails)

If we could achieve that, that would mean faster pipelines, and maximize resources utilization. That's for later though, something I'd like to tackle in the (not too far, hopefully) future.

In any case, I think my linked PR should achieve the deduplication while keeping stuff in gitlab-ci.

I haven't converted everything, but that should not be that hard (for vagrant, given hashicorp recent licensing shenanigans, we should probably drop this at some point, but for now I guess it can stay that way)

ant31 commented 3 months ago

I actually think we should have less dependencies between jobs

We don't want to spend "money" on tests for a PR that doesn't pass unit tests and or fails to deploy a simple setup. It got to this configuration because we had too many parallel long-running jobs on every push, and it used too much.

How many pushes per PR on average? How many commits failed on early tests? The pipeline is set up to use more resources at each stage increasingly.

If we could achieve that, it would mean faster pipelines and maximize resource utilization.

That's assuming resources are not utilized; they are. Faster pipeline: Yes, but in practice no when waiting time for an available runner or VM is added.

Each job on each stage is configured to run in parallel when possible. Sometimes they don't, because it has already reached the limit.

We have a fixed set of resources / $ to use. That said, the pipeline can still be reorganized and improved, but an overall guideline is that it should not use much "more".

VannTen commented 3 months ago

If we could achieve that, it would mean faster pipelines and maximize resource utilization.

That's assuming resources are not utilized; they are.

They are, but not always, and that's where it could be improved. The most annoying problem we regularly face is more related to using gitlab.com than the cluster capacity itself:

If scheduling a pipeline would go over 500 jobs (in total), the pipeline is not scheduled and gets an error. It's more problematic than just waiting for CI, because the pipeline is not re-scheduled at all, and I think that + the confusion with Prow confuses a lot of contributor (see the regular /retest comments, even though they don't do anything :/ )

We have a fixed set of resources / $ to use. That said, the pipeline can still be reorganized and improved, but an overall guideline is that it should not use much "more".

I don't want to use more resources, I want to use the same amount, but faster (in other words, maximize resource utilization). The cluster does not autoscale, right ?

IMO, we should aim for:

=> that would allow us to rely less on manual ordering / gating and use as much resources as available, with minimal waste.

Even better would be Coscheduling (so that we only schedule a complete testcase inventory in max usage scenario). Since we now have it in Kubespray, that might be easier that I think.

ant31 commented 3 months ago

Such changes were explored several years back (everything in a single stage) and rolled-back due to too high usage and we had even more resources than now: #2325

I don't want to use more resources, I want to use the same amount, but faster (in other words, maximize resource utilization). The cluster does not autoscale, right ?

It does for some jobs up to 12 machines. Otherwise, it's only one machine running it all, and it's already heavily used in term of memory, of course there are low-usage period in the day. There are still various improvements possible, for example, those machines that are created on each push: They are: 1. long to get online, 2: expensive (we could have one more server for that cost). It would already get things faster without them (the docker-machine jobs)

auto-cancellation on re-push / fail (that'd need a rework of the provisioning so that GC is independant from gitlab CI jobs, I think. Maybe doable with ownerRefs in the cluster 🤔)

Last I checked, there's already a GC on schedule, so that's fine. And a new push on the same PR should canceled previous pipeline (if not it's missconfigured)

Even better would be Coscheduling (so that we only schedule a complete testcase inventory in max usage scenario). Since we now have it in Kubespray, that might be easier that I think.

^ not sure to understand this

ant31 commented 3 months ago

@VannTen Once we remove the autoscaling part (all jobs with the tag "c3.small.x86"), things can be tested and remapped. Worst case, it's reverted. Happy to try few stages and see

VannTen commented 3 months ago

I don't want to use more resources, I want to use the same amount, but faster (in other words, maximize resource utilization). The cluster does not autoscale, right ?

It does for some jobs up to 12 machines. Otherwise, it's only one machine running it all, and it's already heavily used in term of memory, of course there are low-usage period in the day.

That's the docker-machine jobs ? I'm pretty sure we should get rid completely of those (in fact, I'd go as far as to use exclusively the Kubernetes executor). docker-machine is abandoned (well maintained by Gitlab, but it's a dead end)

auto-cancellation on re-push / fail (that'd need a rework of the provisioning so that GC is independant from gitlab CI jobs, I think. Maybe doable with ownerRefs in the cluster 🤔)

Last I checked, there's already a GC on schedule, so that's fine.

Yeah but the schedule is hourly right ? (that's the openstack-cleanup in https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/pipeline_schedules ?) What I was thinking about was having ownerRefs of the Pod job in created Kubevirt Objects, so once the job is over and the pod discarded, the Kubevirt stuff is always gc'ed by k8s immediately.

And a new push on the same PR should canceled previous pipeline (if not it's missconfigured)

It does not work completely, because some jobs are marked interruptible: false, so once they are started the pipeline does not cancel. I'm not sure if it's because it would left stuff behind otherwise ?

Even better would be Coscheduling (so that we only schedule a complete testcase inventory in max usage scenario). Since we now have it in Kubespray, that might be easier that I think.

^ not sure to understand this

Basically if we have two jobs which provision several kubevirt: jobA create kubevirt-A-1, kubevirt-A-2, kubevirt-A-3 jobB create kubevirt-B-1, kubevirt-B-2, kubevirt-B-3

When the cluster if full, it's possible to get in a state where kubevirt-A-1 kubevirt-A-2 , kubevirt-B-1 are scheduled -> Running kubevirt-A-3 kubevirt-B-1 , kubevirt-B-2 are Pending because of node capacity

Thus neither jobA nor jobB can progress, but still block resources :/

Coscheduling1 would ensure that a set of pods are either all scheduled together or not at all (well, until more resources are freed). So it minimizes the impact of "pending" jobs.

ant31 commented 3 months ago

That's the docker-machine jobs ?

Yes

Coscheduling[1] would ensure that a set of pods are either all scheduled together or not at all (well, until more resources are freed). So it minimizes the impact of "pending" jobs.

sounds good

It does not work completely, because some jobs are marked interruptible: false, so once they are started the pipeline does not cancel. I'm not sure if it's because it would left stuff behind otherwise ?

I don't know. I suggest to remove interruptible: false and evaluate the situation.

Yeah but the schedule is hourly right ?

Yes, also the VM / docker-machines had an automatic teardown after 4h (I reduced to 1h last week)

What I was thinking about was having ownerRefs of the Pod job in created Kubevirt Objects, so once the job is over and the pod discarded, the Kubevirt stuff is always gc'ed by k8s immediately.

Yes, the idea was in the air at some point. Not sure why it wasn't implemented, could be lack of time from maintainers or a blocker that I'm not aware off. Worth to investigate again