sous-chefs / sql_server

Development repository for the sql_server cookbook
https://supermarket.chef.io/cookbooks/sql_server
Apache License 2.0
63 stars 124 forks source link

Limit ci integration parallel jobs #170

Closed JHBoricua closed 3 years ago

JHBoricua commented 3 years ago

Description

limits number of CI integration jobs

Issues Resolved

This attempts to avoid the rate-limit issue when downloading vagrant images that's causing the ci jobs to fail.

Check List

JHBoricua commented 3 years ago

I'm limiting the number of parallel integration jobs to 5 based on the number of jobs that were able to launch successfully on PR #165. If this proves too much it can be lowered to 4 which I already confirmed on a previous commit doesn't run into Vagrant's rate limits.

JHBoricua commented 3 years ago

Looks like 5 parallel jobs did in fact triggered Vagrant's rate-limiting enforcement. Going to lower to 4 and kick off Ci jobs again.

JHBoricua commented 3 years ago

Results are:

  1. first attempt at 4 parallel integration jobs, no triggering of rate-limits from Vagrant.
  2. second attempt at 5 parallel integration jobs, no triggering of rate-limits
  3. third attempt at 5 again, about an hour later after second attempt. Rate limited by Vagrant site.
  4. fourth attempt lowering to 4 parallel jobs, still rate limited
  5. attempt lower further to 2 parallel jobs a few hours later, not rate limited but took over 2 hours.

I'm increasing back to 4 as it seems to be the sweet spot, assuming there are no back to back attempts. It's hard to tell what's the limit on requests that Vagrant is enforcing and how long are they being enforced time-wise since the logs are not showing the response headers.

JHBoricua commented 3 years ago

And even though the last CI jobs ran 8 hours ago, we are already seeing a bunch of rate limit errors on the latest run this morning after updating the changelog. Which makes no sense. Pretty frustrating not knowing what HashiCorp is imposing in terms of rate limits. Hosting the vagrant image somewhere else might be a possible solution to overcome this. Otherwise, even with limiting the number of parallel jobs, we may have to accept that a bunch of the kitchen tests might simply fail anyway.

JHBoricua commented 3 years ago

@damacus Not sure how you want to proceed in light of the latest test results. If you want me to simply close this while other options are explored, let me know.

ramereth commented 3 years ago

@damacus Not sure how you want to proceed in light of the latest test results. If you want me to simply close this while other options are explored, let me know.

Yeah, it seems as though this is IP based based on this doc which we don't have any control over.

I think the next best option is to look at using GitHub Action caching and start caching the boxes. If that doesn't work, then we may want to reach out to HashiCorp and see if they have any suggestions.

JHBoricua commented 3 years ago

@ramereth Ok, I'll close this PR.

JHBoricua commented 3 years ago

@ramereth I started looking at using action caching, as it seemed promising, but there are two problems with it. It has a 5GB size limit and that means each server os generation would be downloaded at least once every time the CI workflow is run as the images are about that size, which could in theory still trigger the rate limits. The second problem, which I just confirmed on a test with my cloned repo as I was typing this, is that the images for server 2016 and 2019 are slightly bigger than 5GB, so they won't cache anyway. Only server 2012R2 was able to cache.

ramereth commented 3 years ago

@JHBoricua I concerned they might hit some kind of limit. I'd say the next step is to reach out to HashiCorp's support and see what we can do as I imagine this might be hitting other GH Actions users.

I suppose that the Sous Chefs could put these images in some Digital Ocean object buckets to workaround it, but then we'd need to update it manually. @sous-chefs/board thoughts on that?

damacus commented 3 years ago

I'll see if I can reach out to someone at hashicorp. They at least will have seen this before, and might be able to suggest something we haven't thought about

JHBoricua commented 3 years ago

@damacus @ramereth

Something I observed when testing Github caching. On my second test, I did get rate-limited pulling the 2012R2 image, but not the 2016/2019 ones (i had not pulled them on the first attempt). So they seem to be doing this on a per-object basis. So I'm wondering if the test are rearranged so that it tests each suite separately, which allows more time in between the image downloads of each os under that suite, it will not trigger the rate limits. It means breaking up the integration job into multiple ones, one for client, install, sql2012, sql2016, sql 2017, sql2019. It should, in theory, allow us to test each integration job in parallel.

I can test that approach on my repo and report back if you folks want me to.

JHBoricua commented 3 years ago

Nevermind, I still got rate-limited even though there was a 10 minute gap between the image downloads of each windows version.