Use `nextest` and faster action runners for unit tests task

kantai commented 1 year ago

I wanted to do some simple tests of how much improvement this repo could get from basically two interventions:

Drop in nextest for the unit tests task
Use 4vcpu or 8vcpu buildjet runners

Nextest for Unit Tests

To do this, I basically just did the following:

diff --git a/.github/workflows/stacks-blockchain-tests.yml b/.github/workflows/stacks-blockchain-tests.yml
index 56ffccdab..ef70966d5 100644
--- a/.github/workflows/stacks-blockchain-tests.yml
+++ b/.github/workflows/stacks-blockchain-tests.yml
@@ -37,13 +37,16 @@ jobs:

   # Run unit tests with code coverage
   unit-tests:
-    name: Unit Tests
+    name: Unit Tests (nextest)
     runs-on: ubuntu-latest
     steps:
       - name: Add code coverage tools
         run: |
           rustup component add llvm-tools-preview
           cargo install grcov
+      - name: Add nextest
+        run: |
+          cargo install cargo-nextest
       - name: Checkout the latest code
         id: git_checkout
         uses: actions/checkout@v3
@@ -53,7 +56,13 @@ jobs:
           RUSTFLAGS: -Cinstrument-coverage
           LLVM_PROFILE_FILE: stacks-blockchain-%p-%m.profraw
         run: |
-          cargo test --workspace
+          cargo nextest run --workspace
+      - name: Collate grcov
+        id: unit_tests_grcov
+        env:
+          RUSTFLAGS: -Cinstrument-coverage
+          LLVM_PROFILE_FILE: stacks-blockchain-%p-%m.profraw
+        run: |
           grcov . --binary-path ./target/debug/ -s . -t lcov --branch --ignore-not-existing --ignore "/*" -o lcov.info
       - name: Upload codecov results
         uses: codecov/codecov-action@v3

That's pretty much just a drop-in replacement of cargo test --workspace with cargo nextest run --workspace. It also separated out the grcov task (I still need to test to make sure that the generated coverage still works, but the grcov invocation isn't erroring).

With just that, the unit tests task takes 1hr37m to complete on this repo instead of the ~5 hours otherwise. A big win!

Buildjet runners

Now, I also tried out in a private fork the performance gains of switching to a buildjet runner.

To do that, I just changed the workflow definition to:

diff --git a/.github/workflows/stacks-blockchain-tests.yml b/.github/workflows/stacks-blockchain-tests.yml
index 56ffccdab..dbe9cd589 100644
--- a/.github/workflows/stacks-blockchain-tests.yml
+++ b/.github/workflows/stacks-blockchain-tests.yml
@@ -37,13 +37,20 @@ jobs:

   # Run unit tests with code coverage
   unit-tests:
-    name: Unit Tests
-    runs-on: ubuntu-latest
+    name: Unit Tests (nextest) - ${{matrix.runs-on}}
+    strategy:
+      fail-fast: false # In case one of the jobs fails, the other jobs won't be cancelled
+      matrix:
+        runs-on: [ buildjet-4vcpu-ubuntu-2204, buildjet-8vcpu-ubuntu-2204 ]
+    runs-on: ${{matrix.runs-on}}
     steps:
       - name: Add code coverage tools
         run: |
           rustup component add llvm-tools-preview
           cargo install grcov
+      - name: Add nextest
+        run: |
+          cargo install cargo-nextest
       - name: Checkout the latest code
         id: git_checkout
         uses: actions/checkout@v3
@@ -53,7 +60,13 @@ jobs:
           RUSTFLAGS: -Cinstrument-coverage
           LLVM_PROFILE_FILE: stacks-blockchain-%p-%m.profraw
         run: |
-          cargo test --workspace
+          cargo nextest run --workspace
+      - name: Collate grcov
+        id: unit_tests_grcov
+        env:
+          RUSTFLAGS: -Cinstrument-coverage
+          LLVM_PROFILE_FILE: stacks-blockchain-%p-%m.profraw
+        run: |
           grcov . --binary-path ./target/debug/ -s . -t lcov --branch --ignore-not-existing --ignore "/*" -o lcov.info
       - name: Upload codecov results
         uses: codecov/codecov-action@v3

Basically, the only change (in addition to the previously discussed nextest changes) was switching runs-on to a matrix of the 4 and 8 vcpu buildjet runners. I did a matrix run here because I wanted to see if there was a big speedup between those two runners.

The speedup here is enormous:

That's 11 minutes!

The 4vcpu is also much quicker: it completed the unit tests in 17m52 seconds. The 8vcpu isn't quite twice as fast as the 4vcpu version, so it may not make sense to use 8 instead of 4.

But in either event, that's looking at 18 minutes versus 97 minutes. That's a huge improvement. And looking at it in total, that's 18 minutes versus ~300 minutes.

wileyj commented 1 year ago

I noticed similar improvements when i was testing buildjet, but ran into a few roadblocks (mainly cost) when evaluating that service.

if we ran each test that is currently defined on buildjet with nextest, the full workflow can cost upwards of $5 USD per workflow execution. before i would commit to spending the money on buildjet, i think we need to run some tests more conditionally in the workflows (i.e. is it a markdown change? no tests. PR comment? no tests) etc.

wileyj commented 1 year ago

@kantai But, what you tested here is essentially what i've also been testing on my fork.

i wonder if it's possible to conditionally run a workflow on buildjet if certain condtions are met, like a release is tagged will run on buildjet whereas a PR approval would run on generic github runners

kantai commented 1 year ago

Right -- but this is also just running buildjet on the unit tests task (which is the current bottleneck). For that, it would be either ~14 minutes @ 0.016 per minute ($0.22) or ~23 minutes @ 0.008 per minute ($0.18).

Speeding up the bitcoin-integration-tests (which eats way more compute time) can probably be done with the current vanilla github runners, and I'd suggest could be done separately from trying to speed up the unit tests.

wileyj commented 1 year ago

hmm, i may want to run some extra tests as well - i was working on the idea of using a cache for the binaries to run the tests (which works quite well in fact). but i'm not sure if buildjet has access to github's cache (i know buildjet has it's own caching service, which i used when running all tests on buildjet).

for comparison, i'm running the unit-tests step in a similar manner with a cache and nextest, and the tests using a github runner took 33m.

kantai commented 1 year ago

That's awesome! I would suggest opening a PR against develop with the unit-tests updates (or I might with just the nextest run replacement). At the moment, basically all progress on PRs is stalled because of this -- essentially no PRs can merge while the unit tests take as long as they do.

wileyj commented 1 year ago

That's awesome! I would suggest opening a PR against develop with the unit-tests updates (or I might with just the nextest run replacement). At the moment, basically all progress on PRs is stalled because of this -- essentially no PRs can merge while the unit tests take as long as they do.

I would support your PR first, since it will work and it's a small change. in my fork, i'm taking a larger look at the other workflows (the nextext/cache work was the initial goal, but i realized that more changes would be needed to support it). The buildjet aspect has me thinking as well, so i want to take another look at that and perhaps we can conditionally use that service to speed up some workflows.

stacks-network / stacks-core

Use `nextest` and faster action runners for unit tests task #3808

Nextest for Unit Tests

Buildjet runners