ruby / setup-ruby

An action to download a prebuilt Ruby and add it to the PATH in 5 seconds
https://github.com/marketplace/actions/setup-ruby-jruby-and-truffleruby
MIT License
787 stars 254 forks source link

Action has become very slow #543

Closed ledermann closed 10 months ago

ledermann commented 10 months ago

Ensure the following before filing this issue

Are you running on a GitHub-hosted runner or a self-hosted runner?

GitHub-hosted runner

Link to the failed workflow job (must be a public workflow job, so the necessary information is available)

https://github.com/templatus/templatus-hotwire/actions/runs/6713970783/job/18246455669

Any other notes?

The action has become slow lately, the setting up Ruby went from around 10 seconds to over 2 minutes.

Furthermore, there is a strange difference between the duration measured by GitHub and the time indicated by the action itself. See this example: The job itself says "9.52 seconds", but GitHub says "2m 21s"

Screenshot

The code I use is very simple:

- name: Set up Ruby
  uses: ruby/setup-ruby@v1
  with:
    bundler-cache: true
MSP-Greg commented 10 months ago

I've noticed this also. If you turn on the timestamps or download the log, the following is shown:

2023-11-01T00:23:21.0213624Z Cache key: setup-ruby-bundler-cache-v6-ubuntu-22.04-x64-ruby-3.2.2-wd-/home/runner/work/templatus-hotwire/templatus-hotwire-with--without--only--Gemfile.lock-cf68edb4543c7c3c8470ee8a3b747071dac9a7f804eae170528713c041520cc8
2023-11-01T00:23:21.6767350Z Cache Size: ~54 MB (56493164 B)
2023-11-01T00:23:21.6876926Z [command]/usr/bin/tar -xf /home/runner/work/_temp/439c9c90-f5cc-42d3-9750-233b8e5abcfc/cache.tzst -P -C /home/runner/work/templatus-hotwire/templatus-hotwire --use-compress-program unzstd
2023-11-01T00:23:22.1771109Z Received 56493164 of 56493164 (100.0%), 53.8 MBs/sec
2023-11-01T00:23:22.2658245Z Cache restored successfully
2023-11-01T00:23:22.2744930Z Found cache for key: setup-ruby-bundler-cache-v6-ubuntu-22.04-x64-ruby-3.2.2-wd-/home/runner/work/templatus-hotwire/templatus-hotwire-with--without--only--Gemfile.lock-21ff1081c32103b8892cccf02164ddbb1185541cdcb08c073975b58f69c420dc
2023-11-01T00:23:22.2754773Z [command]/opt/hostedtoolcache/Ruby/3.2.2/x64/bin/bundle install --jobs 4
2023-11-01T00:23:25.6540433Z Fetching gem metadata from https://rubygems.org/.........
2023-11-01T00:23:25.9126946Z Fetching sidekiq 7.2.0
2023-11-01T00:23:26.0051303Z Installing sidekiq 7.2.0
2023-11-01T00:23:26.0760837Z Bundle complete! 35 Gemfile dependencies, 155 gems now installed.
2023-11-01T00:23:26.0763313Z Bundled gems are installed into `./vendor/bundle`
2023-11-01T00:23:26.1237137Z [command]/opt/hostedtoolcache/Ruby/3.2.2/x64/bin/bundle clean
2023-11-01T00:23:26.5386807Z Removing sidekiq (7.1.6)
2023-11-01T00:23:26.5480805Z Saving cache
2023-11-01T00:23:26.5691124Z [command]/usr/bin/tar --posix -cf cache.tzst --exclude cache.tzst -P -C /home/runner/work/templatus-hotwire/templatus-hotwire --files-from manifest.txt --use-compress-program zstdmt
2023-11-01T00:23:29.9885417Z Cache Size: ~54 MB (56581855 B)
2023-11-01T00:23:30.0962980Z Cache saved successfully
2023-11-01T00:23:30.1085596Z Took   9.52 seconds
2023-11-01T00:25:40.4096614Z ##[group]Run actions/setup-node@v4.0.0
2023-11-01T00:23:21.0213624Z Cache key: setup-ruby-bundler-cache-v6-ubuntu-22.04-x64-ruby-3.2.2-wd-/home/runner/work/templatus-hotwire/templatus-hotwire-with--without--only--Gemfile.lock-cf68edb4543c7c3c8470ee8a3b747071dac9a7f804eae170528713c041520cc8
2023-11-01T00:23:21.6767350Z Cache Size: ~54 MB (56493164 B)
2023-11-01T00:23:21.6876926Z [command]/usr/bin/tar -xf /home/runner/work/_temp/439c9c90-f5cc-42d3-9750-233b8e5abcfc/cache.tzst -P -C /home/runner/work/templatus-hotwire/templatus-hotwire --use-compress-program unzstd
2023-11-01T00:23:22.1771109Z Received 56493164 of 56493164 (100.0%), 53.8 MBs/sec
2023-11-01T00:23:22.2658245Z Cache restored successfully
2023-11-01T00:23:22.2744930Z Found cache for key: setup-ruby-bundler-cache-v6-ubuntu-22.04-x64-ruby-3.2.2-wd-/home/runner/work/templatus-hotwire/templatus-hotwire-with--without--only--Gemfile.lock-21ff1081c32103b8892cccf02164ddbb1185541cdcb08c073975b58f69c420dc
2023-11-01T00:23:22.2754773Z [command]/opt/hostedtoolcache/Ruby/3.2.2/x64/bin/bundle install --jobs 4
2023-11-01T00:23:25.6540433Z Fetching gem metadata from https://rubygems.org/.........
2023-11-01T00:23:25.9126946Z Fetching sidekiq 7.2.0
2023-11-01T00:23:26.0051303Z Installing sidekiq 7.2.0
2023-11-01T00:23:26.0760837Z Bundle complete! 35 Gemfile dependencies, 155 gems now installed.
2023-11-01T00:23:26.0763313Z Bundled gems are installed into `./vendor/bundle`
2023-11-01T00:23:26.1237137Z [command]/opt/hostedtoolcache/Ruby/3.2.2/x64/bin/bundle clean
2023-11-01T00:23:26.5386807Z Removing sidekiq (7.1.6)
2023-11-01T00:23:26.5480805Z Saving cache
2023-11-01T00:23:26.5691124Z [command]/usr/bin/tar --posix -cf cache.tzst --exclude cache.tzst -P -C /home/runner/work/templatus-hotwire/templatus-hotwire --files-from manifest.txt --use-compress-program zstdmt
2023-11-01T00:23:29.9885417Z Cache Size: ~54 MB (56581855 B)
2023-11-01T00:23:30.0962980Z Cache saved successfully
2023-11-01T00:23:30.1085596Z Took   9.52 seconds
2023-11-01T00:25:40.4096614Z ##[group]Run actions/setup-node@v4.0.0

So, the slowdown (2:10) is between the last log for setup-ruby and the first log of the next step in your Actions job, as the last two lines of the above log show. What's taking up the 2:10 is something I haven't yet looked at.

But, some of the logs here and elsewhere do not consistently show the delays...

dentarg commented 10 months ago

The job itself says "9.52 seconds", but GitHub says "2m 21s"

the logs here and elsewhere do not consistently show the delays

Sounds to me that the slowness/delays is outside of the control of this action?

I can see it myself too, 5 days ago it went from 30s to 2min+ https://github.com/Starkast/wikimum/actions/workflows/ci.yml, but it is not consistent, there was one quick job after a slow job, both jobs used 54a18e26dbbb1eabc604f317ade9a5788dddef81 – this too indicates it is not something in this action that causes the slowness.

Let's ping the GitHub people that helped in https://github.com/ruby/setup-ruby/pull/494, @Steve-Glass @chkimes, maybe they can look into this.

chkimes commented 10 months ago

For those who are experiencing this, does it always appear to happen after restoring from the cache? I reviewed the timestamps, and it's interesting that there is a significant gap between the last output of one step and the first output of the next.

chkimes commented 10 months ago

Through sheer luck, I think you have pinged one of the few people who might understand what is going on. I was poking at the cache action a few months ago to fix the cache behavior when the connection to Azure Blob gets interrupted. One of the things I discovered while testing that is that the client code was leaking promises (I was not able to track this down).

I ended up working around that in the cache action itself here: https://github.com/actions/cache/blob/704facf57e6136b1bc63b828d79edcd491f0ee84/src/restoreImpl.ts#L102

setup-ruby however, is referencing the @actions/cache package directly so it's not using the early exit code: https://github.com/ruby/setup-ruby/blob/master/bundler.js#L5C7-L5C12

The root fix is probably to address this as a bugfix in the cache library. A quick mitigation could be to add a process.exit call here: https://github.com/ruby/setup-ruby/blob/master/index.js#L32

MSP-Greg commented 10 months ago

@chkimes

does it always appear to happen after restoring from the cache?

From a quick look at several jobs, it appears that the issue is saving the cache, as all the jobs I looked at took the normal amount of time, but also had exact cache key hits on the restore, so they weren't saving the cache. HTH...

chkimes commented 10 months ago

Ah interesting. I don't know of anything specific that would cause a difference in behavior for save vs. restore - the issue could also exist in either code path since a leaked promise will block the runtime from executing regardless of where it's leaked from. But if all instances are showing a cache save, then that's a strong sign.

Are there any examples of a save step that don't experience the delay? That is, do we have a 100% repro during cache saves?

dentarg commented 10 months ago

Are there any examples of a save step that don't experience the delay?

I think this is one https://github.com/Starkast/wikimum/actions/runs/6638880163/job/18036138791#step:4:124

ledermann commented 10 months ago

I think this is one https://github.com/Starkast/wikimum/actions/runs/6638880163/job/18036138791#step:4:124

This example is from Oct 25, so it was run before setup-ruby was updated to Node 20 on Oct 26 in v1.159.0

I see the delay only if the following two conditions are met:

I made this test:

  1. Clear cache (using gh cache delete --all)
  2. Run last workflow again → Cache is saved → 2-min delay
  3. Run last workflow again → Cache is not saved, no delay
  4. Revert to v1.158.0, new workflow → Cache is not saved, no delay
  5. Clear cache again
  6. Run last workflow again → Cache is saved, no delay

By the way, there is a similar issue in setup-node after upgrading to Node 20, which also occurs only after saving the cache: https://github.com/actions/setup-node/issues/878

ledermann commented 10 months ago

Based on my observations above, I created a repo with a minimal reproducible example: https://github.com/ledermann/setup-ruby-delay-demo/actions

There you can see that saving the cache is slow with v1.159.0, but fast with v1.158.0.

chkimes commented 10 months ago

Thanks! This is great investigation - let me get the toolkit team involved to look at the cache package and how its behavior might have changed in Node 20.

Bo98 commented 10 months ago

This is likely due to https://github.com/nodejs/node/issues/47228. The reason it regressed in Node 20 compared to Node 16 is because Node 19 changed the default of keepAlive from false to true.

The HTTP client code is wrapped in the @actions/http-client package, which controls its own default of keepAlive. However it doesn't override Node's default correctly, and there's already an open PR to fix this: https://github.com/actions/toolkit/pull/1572. This will restore the Node 16 behaviour for anything using @actions/http-client.

I've tested that PR (see https://github.com/Bo98/toolkit-test/commit/2fc37709b6aed5db522e867a55868107a1d8b4c4) and can confirm it fixes the slowness: https://github.com/Bo98/toolkit-test/actions/runs/6758550452/job/18370288942 (compared to without that PR: https://github.com/Bo98/toolkit-test/actions/runs/6758456188/job/18370108410).

One of the things I discovered while testing that is that the client code was leaking promises (I was not able to track this down).

I ended up working around that in the cache action itself here: https://github.com/actions/cache/blob/704facf57e6136b1bc63b828d79edcd491f0ee84/src/restoreImpl.ts#L102

The reason you likely saw this is because the restore side of @actions/cache does force keepAlive on itself: https://github.com/actions/toolkit/blob/fe3e7ce9a7f995d29d1fcfd226a32bca407f9dc8/packages/cache/src/internal/downloadUtils.ts#L220, so that explains seeing the issue under Node 16 in that particular scenario.

eregon commented 10 months ago

The root fix is probably to address this as a bugfix in the cache library. A quick mitigation could be to add a process.exit call here: https://github.com/ruby/setup-ruby/blob/master/index.js#L32

Would this solve this keepAlive connections hanging? If so that sounds an easy/safe enough fix to merge.

eregon commented 10 months ago

Yes, that helps, thank you all for the investigation! Unfortunate that the Node 20 update brought this bug in.

PR in https://github.com/ruby/setup-ruby/pull/546

Slow build in https://github.com/ruby/setup-ruby/actions/runs/6788405695/job/18453344720 (2min 17s)

2023-11-07T17:54:28.4897445Z Took   3.80 seconds
2023-11-07T17:56:38.5288805Z ##[group]Run ruby -v

Fast build in https://github.com/ruby/setup-ruby/actions/runs/6788443622/job/18453477353?pr=546 (5s, as advertised :) )

2023-11-07T17:58:41.1757838Z Took   2.20 seconds
2023-11-07T17:58:41.1894986Z ##[group]Run ruby -v
eregon commented 10 months ago

@MSP-Greg You probably want to explicitly process.exit() too in https://github.com/ruby/setup-ruby-pkgs so this performance bug is fixed too for setup-ruby-pkgs. (and same for potentially other actions reusing this action's source code)

eregon commented 10 months ago

One left unknown is builds on https://github.com/ruby/setup-ruby/pull/540 seem fast. Anyway, since we worked around the issue it's all fine now.

dentarg commented 10 months ago

One left unknown is builds on #540 seem fast

No Saving cache happening?