saltstack-formulas / template-formula

SaltStack formula template filled with dummy content
http://docs.saltstack.com/en/latest/topics/development/conventions/formulas.html
Other
119 stars 85 forks source link

Resolve Travis concurrent jobs or consider Azure pipelines #118

Open myii opened 5 years ago

myii commented 5 years ago

https://freenode.logbot.info/saltstack-formulas/20190516#c2180979:

- - -
09:59 myii @​gtmanfred A problem is beginning to manifest: as we convert more formulas to semantic-release (and generally get inspec testing working on formulas), our Travis queuing is slowing things down. It only appears to handle 3 jobs at a time; the most I've ever seen is 5, but that may have been on my own forks. So it's not uncommon to see 3/50 active jobs and when things get bad, it can take up to an hour to get results.I'm usually manually cancelling and rerunning (my own) jobs to minimise loads but contributors won't be able to do that. This load will only increase in time, as more formulas get converted. We could start limiting the number of jobs per formula but is there anything else that can be done?
10:06 slack1872 except paying for more concurrent builds, I'm not sure...
10:06 myii I reckon we're going to have to start limiting testing per formula.
10:07 myii Unless a formula has specific tests that need to be run on specific combinations of platform/salt-version.
10:08 slack1872 Same problem always pops: you had tests (which is great) and then your tests take too much time...We have the same problem @ work, with 8 kitchen VM created and highstate taking near 15 min. But on Jenkins at least we can run VM in parallel
10:09 myii What do you say to this idea:
10:09 myii 5 instances max, one from each os.
10:10 myii 2019.2 x2, 2018.3 x2 and 2017.7 x1.
10:12 myii We've got a nice set to choose from but now this is about how to select which ones to use per formula.
10:12 myii Is 5 sufficient for testing purposes?

https://freenode.logbot.info/saltstack-formulas/20190516#c2181226:

- - -
12:39 gtmanfred no, limit concurrent jobs
12:39 gtmanfred in travis
12:39 gtmanfred you can have as many runs as you want, but it will only run one per repo at a time

https://freenode.logbot.info/saltstack-formulas/20190516#c2181237:

- - -
12:41 gtmanfred if that doesn’t work, we can look into moving to azure pipelines
12:41 gtmanfred i have been playing with that recently and it is pretty good
12:41 myii Excellent.
12:41 gtmanfred not as easy to build a matrix automatically for, but still pretty easy
12:42 myii Look forward to it.
12:43 gtmanfred Here is a version that builds the matrix https://github.com/saltstack/pepper/blob/develop/azure-pipelines.yml
12:43 myii Looks decent.
12:43 gtmanfred and here is a version where you just define it https://github.com/gtmanfred/pepper/blob/f15ce586a02b377ef1f8bdb48e378c5505179074/azure-pipelines.yml
12:44 myii The loop is nicer.
12:44 gtmanfred agreed
12:44 gtmanfred one thing that it does not have is allowed_failures
12:44 gtmanfred and the yaml can be gross
12:45 myii I don't think we're really using that.
12:45 myii So it shouldn't be a problem.
12:45 gtmanfred There is also this limitation with the yaml https://twitter.com/resticbackup/status/1117327446456655873
12:45 gtmanfred it does not parse whole elements in dot net, it parses one line at a time
12:46 myii OK, good to be aware of that.
12:47 myii If this comes to fruition, we'll put these notes in a little doc.
aboe76 commented 5 years ago

please also test cirrus-ci: https://cirrus-ci.org/#comparison-with-popular-ciaas

myii commented 5 years ago

@aboe76 Would you mind bringing that up in the Slack room? More eyes will see it there. In the meantime, what do you think about limiting the number of runs per formula? Should it be one for each os; is that sufficient?

aboe76 commented 5 years ago

most formula's only needs these different os's, :

when switching to circus ci the concurency limit isn't a factor any more: https://cirrus-ci.org/#key-highlights

javierbertoli commented 5 years ago

most formula's only needs these different os's, :

  • debian (this covers ubuntu/debian derivites)
  • centos (this covers redhat/fedora derivites)
  • opensuse (because sles/susemanager stuff)
  • bsd

I agree with @aboe76: having a 20+ images available does not mean that we need to test with all of them. I think that to test that a formula works, the rationale should be something like:

  1. Latest Salt + py3 on
    • debian
    • centos
    • opensuse
    • bsd
  2. If something is known to fail or behaves different with
    • a particular distro (ie, fedora),
    • distro release (ie, centos-6)
    • salt version (ie, 2018)
    • python version (ie, py )

tests for the required combination of the above should be added to the test matrix.

I think that part of the review of PRs should check the testing matrix to be enough but not an excess. Testing is easy, knowing what and when to test is not :yum:

Also, cirrus-ci sounds interesting :+1:

My 2c

myii commented 5 years ago

So after some discussion on Slack prompted by @aboe76, we've got Cirrus CI enabled for the whole org. Some specific notes about BSD:

- - -
21:07 gtmanfred you would need to use vagrant for bsd
21:08 gtmanfred realistically though, you could use kitchen passthrough, to run the commands on the bsd instances, instead of creating a new docker container to run the tests on
21:10 gtmanfred when you get to it, see how we using the proxy for windows on appveyor https://github.com/saltstack/salt-jenkins/blob/develop/.kitchen.appveyor.yml
21:10 gtmanfred it should translate to bsd probably

I've started work on testing Cirrus CI with cron-formula but I've not yet had success getting it to play nicely with kitchen. There'll be some special invocation but there aren't any examples out there in the wild that I could use as references.

n-rodriguez commented 5 years ago

I've started work on testing Cirrus CI with cron-formula but I've not yet had success getting it to play nicely with kitchen. There'll be some special invocation but there aren't any examples out there in the wild that I could use as references.

https://cirrus-ci.org/guide/writing-tasks/#additional-containers

myii commented 5 years ago

So a first example of the difference between 15 and 5, thanks to @daks:

Instances Running time Total time
15 16 min 24 sec 37 min 20 sec
5 7 min 17 sec 10 min 47 sec

@aboe76 @javierbertoli Are these the 5 images you're suggesting or did you mean stick with the os_family images and only use the os variants when necessary?

By the way, also note that we're leaving all of the platforms available in kitchen.yml, so that any/all can be tested locally.

aboe76 commented 5 years ago

@myii I would prefer to stick with the os_family variants, and then a little bit conservative, so: 'RedHat' => use centos images 'Debian' => use debian images 'Suse' => use opensuse leap images if possible. 'Bsd' => ??

Having to wait 16 minutes to have a test run for a PR is a bit much...

myii commented 5 years ago

@aboe76 Thanks for that. What do you think of my comment I've just made after you suggested that: https://github.com/saltstack-formulas/ufw-formula/pull/7#issuecomment-493707109?

aboe76 commented 5 years ago

Yes @myii, I think that is a good comment, like with the testing of the the formula works on OS, the same thing can be applied to salt versions.

The testing should be able to spot regressions and issues earlier, but not in a way that is hindering or slowing everything down. I know about edge cases that we might miss, but I don't think that most of the community will be better served if we stick to the broad road instead of narrow paths.

myii commented 5 years ago

So #121 is a suggested workaround to use for the time being. Please have a look so that this selection can be propagated throughout all of the formulas using 10+ instances (i.e. all semantic-release formulas to date).

myii commented 5 years ago

I've just noticed that Limit concurrent jobs has been set to 1 in Travis. I'm not sure that's such a good idea now, since it reduces the benefit of having these run in parallel, to get releases done quickly. Very busy periods aren't that common, so at quiet times, it really slows things down. I've unset this again for the time being but feel free to discuss this if it is an issue.

myii commented 5 years ago

So a little update. With some help from the Cirrus folk and working together with @javierbertoli, we've started making progress with Cirrus CI. Some parts of the developments are mentioned from this point forward, for anyone interested in more of the details. Or to get straight to the output, you can see it here: https://cirrus-ci.com/build/5647302937542656.

I'd like to pull in @alxwr here: Cirrus CI allows testing on *BSD as well. Is this something you'd be interested in helping out with? Or for whoever fancies taking this on, this comment above is pertinent to that.

We're also seriously considering keeping both Travis and Cirrus. Travis for the the release process (i.e. commitlint and semantic-release), with Cirrus for the actual testing.

To gauge whether this is really going to make a significant difference, we need some "lambs to the slaughter" -- suggestions of formulas that can be converted to this hybrid system. Preferably formulas that involve *BSD, MacOS and Windows as well, so that we can test those features as well. It's not really much of a risk since we can always re-enable the matrix in Travis again if it doesn't work out.

n-rodriguez commented 5 years ago

We're also seriously considering keeping both Travis and Cirrus. Travis for the the release process (i.e. commitlint and semantic-release), with Cirrus for the actual testing.

Why not... :) I didn't reply until now because I was thinking about the CI config file : it's cool to have a performant CI but it's cool to have nice config files too ;) And IMHO, Azure's config file is not very sexy, so not very easy to understand and to maintain/evolve. On the other hand, Cirrus' config file is as easy to understand than Travis one so it's a good point for Cirrus.

myii commented 5 years ago

@n-rodriguez You spoke too soon... wait until we start adding *BSD, MacOS and Windows!

myii commented 5 years ago

Just noting another benefit of keeping the release process in Travis: all of the GitHub tokens that have already been set across the repos. Having to do that again for Cirrus isn't appealing.

myii commented 5 years ago

https://cirrus-ci.org/faq/#are-there-any-limits:

Are there any limits?

There are no limits on how many VMs or Containers you can run in parallel if you bring your own compute services or use Compute Credits for either private or public repositories.

Cirrus CI has following limitations on how many VMs or Containers a single user can run for free for public repositories:

  • 8 Linux Containers or VMs
  • 2 Windows Containers or VMs
  • 2 FreeBSD VMs
  • 1 macOS VM

Which means that a single user can run at most 13 simultaneous tasks for free.

No per repository limits

Cirrus CI doesn't enforce any limits on repository or organization levels. All the limits are on per user basis. For example, if you have 10 active contributors to a repository then you can end up with 130 tasks running in parallel for the repository.

myii commented 5 years ago

In terms of ensuring all checks are passing before allowing merges, it's probably best to leave that to the GitHub UI rather than cross-checking from Travis and Cirrus:

If there really was a need to check, there's probably a way via. the GitHub API:

myii commented 5 years ago

Here's the first PR merged using Cirrus CI alongside Travis CI:

Note, we enabled all of the instances in the matrix for initial testing. We'll reduce that eventually, if all goes well.

myii commented 5 years ago

Second PR:

myii commented 5 years ago

First negative points to collect:

  1. There's no (easy?) way to cancel and re-run jobs.
  2. Concurrency per user can end up quite painful -- I'm still waiting for the merge of the first PR above to complete before the checks on the second PR can begin... it might delay me by 30 minutes (this is an extreme case).
  3. Each instance runs slower than the Travis equivalent, maybe even half the speed (1.5 minutes vs. 3 minutes).
  4. Instances can take a long time to start up; we're talking about sometimes even 10 minutes before any start up.
  5. The instances sometimes end up restarting without any clear reason why. This was the answer I got in response to that:

Cirrus CI uses preemptible instances for cost optimization for free OSS builds so instances sometimes got preempted but Cirrus then automatically reschedules it on a regular instances. This is a rare event.

Unfortunately, I've already seen it happen a few times.

myii commented 5 years ago

Some more bad news, unfortunately.

Reduced the test matrix back to 6 (like our reduced Travis runs) but the first stage of the PR (before merge) hasn't completed yet: https://cirrus-ci.com/build/5778854061277184 -- over an hour.

Update: I went away from this comment for a few moments and it's just finished. Now to try the PR merge to see how long that takes...

fkorotkov commented 5 years ago

@myii answered inline:

  1. There's no (easy?) way to cancel and re-run jobs.

You can do from the Cirrus CI Web UI.

  1. Concurrency per user can end up quite painful -- I'm still waiting for the merge of the first PR above to complete before the checks on the second PR can begin... it might delay me by 30 minutes (this is an extreme case).

Unfortunately this is how limits are working for free tasks for OSS. Travis, for example has limit of 5 for an organization. You can consider to use compute credits to avoid the limits. A few OSS projects using it to avoid the concurrency issue for organization members.

  1. Each instance runs slower than the Travis equivalent, maybe even half the speed (1.5 minutes vs. 3 minutes).

I see that bundle install takes quite a while in some tasks. You can consider adding caching as shown in Ruby example.

  1. Instances can take a long time to start up; we're talking about sometimes even 10 minutes before any start up.

There is currently an issue where we hit quotas. A few other organization are way too active together which led to the issue. Bad timing for you. 😅

  1. The instances sometimes end up restarting without any clear reason why. This was the answer I got in response to that.

There is always should be a link to documentation about the restart issue. Instances are getting rescheduled sometimes for cost optimization on our end to provide the free service.

myii commented 5 years ago

@fkorotkov Thanks for your responses, that will give us some ideas about things we can try out.

myii commented 5 years ago

@fkorotkov By the way, in terms of starting, stopping and restarting instances, do you have to open each one manually to do that? We're a bit spoilt by how Travis allows you to do everything from one page.

myii commented 5 years ago

Update: I went away from this comment for a few moments and it's just finished. Now to try the PR merge to see how long that takes...

Some positive news, that didn't work out so bad, around 15 minutes: https://cirrus-ci.com/build/5976877521436672.

myii commented 5 years ago

@fkorotkov above:

I see that bundle install takes quite a while in some tasks. You can consider adding caching as shown in Ruby example.

I've tried various invocations but I can't work out how to get this to work within docker_builder. I didn't find any further documentation about how to take advantage of bundle_cache, except for this example. Can you suggest how to use it in this situation?

fkorotkov commented 5 years ago

@myii you can simply put

bundle_cache:
    folder: /usr/local/bundle
    fingerprint_script: echo $RUBY_VERSION && cat Gemfile && cat Gemfile.lock
    populate_script: bundle install

before verify_script.

Also please feel free to create an issue for adding more buttons like you want.

myii commented 5 years ago

@fkorotkov That was the first thing I tried. I'll do it again now and then link you to the result. Can you confirm the amount of indentation to use?

myii commented 5 years ago

@fkorotkov https://cirrus-ci.com/build/5757379426123776:

FAILED_PRECONDITION: Unknown field bundle_cache on 'Test $INSTANCE' YAML node!

fkorotkov commented 5 years ago

@myii indeed Docker Builder doesn't support any other instructions beside scripts. I don't see a reason why it shouldn't support caches, artifacts, etc. Let me fixed it. 🙌

myii commented 5 years ago

@fkorotkov Thanks.

fkorotkov commented 5 years ago

It should work now.

myii commented 5 years ago

@fkorotkov Thanks for the fast turnaround. While the test run completes, it hasn't appeared to make any difference to the time taken: https://cirrus-ci.com/build/5643032095883264. Anything else that needs to be done with the configuration?

fkorotkov commented 5 years ago

@myii seems bundler caches to a different folder /var/lib/gems. I've tried to cache it but then after cache is restored bundle can't find kitchen. Not sure what's going on there. 😪

myii commented 5 years ago

@fkorotkov Thanks for digging into that, maybe someone else on this thread may be able to figure it out. Appreciate your help.