libp2p / go-libp2p

libp2p implementation in Go
MIT License
6.12k stars 1.08k forks source link

speed up CI #1632

Open marten-seemann opened 2 years ago

marten-seemann commented 2 years ago

CI is taking a long time to run our builds. We should

  1. split up our builds. There's no need to run go test and go test -race sequentially. These can be separate jobs.
  2. use self-hosted runners running on beefier hardware, and without the concurrency restrictions that GitHub Actions imposes
laurentsenta commented 2 years ago

I can't assign myself or the ipdx tag yet, but we're looking into this,

I created a PR to show how we could run the test in separate jobs: https://github.com/libp2p/go-libp2p/pull/1639

We need to investigate some more and make sure this would be an improvement:

(cc @galargh)

marten-seemann commented 2 years ago

We should also see if there are any test cases that are particularly slow. This is not a replacement for optimizing our CI setup, but it might help nonetheless.

I found this tutorial in the Google cache: https://webcache.googleusercontent.com/search?q=cache:5NYzDwHIEj8J:https://leighmcculloch.com/posts/go-find-slow-tests/&cd=3&hl=en&ct=clnk&gl=il

Step 1: Run the tests with verbose json output, without caching.

go test -count=1 -v -json | tee testoutput.txt

Step 2: Extract from that output a sorted list of tests and their elapsed times.

cat testoutput.txt \
  | jq -r 'select(.Action == "pass" and .Test != null) | .Test + "," + (.Elapsed | tostring)' \
  | sort -k2 -n -t,
marten-seemann commented 2 years ago

These are all the tests taking longer than 0.5s (unfortunately, without the package names). The number behind the test name is the duration in seconds.


TestConstraintsCleanup,0.5
TestResolutionMisaligned,0.51
TestTcpTransport/github.com/libp2p/go-libp2p/p2p/transport/testsuite.SubtestStreamOpenStress#01,0.56
TestFuzzManyPeers,0.57
TestListenerCloseClosesQueued,0.6
TestTcpTransport/github.com/libp2p/go-libp2p/p2p/transport/testsuite.SubtestStreamOpenStress,0.61
TestFailuresOnResponder,0.64
TestInMemoryPeerstore/Metadata,0.67
TestInMemoryPeerstore/Metadata/putting_and_getting,0.67
TestDsPeerstore/Leveldb/Metadata,0.72
TestDsPeerstore/Leveldb/Metadata/putting_and_getting,0.72
TestBackoffConnector,0.74
TestStressActiveDial,0.89
TestAutoNATPublic,1
TestDeadlines,1
TestDialWorkerLoopConcurrentFailure,1
TestDialWorkerLoopFailure,1
TestHostAddrChangeDetection,1
TestLocalIPChangesWhenListenAddrChanges,1
TestAutoNATPrivate,1.01
TestIdentifyDeltaWhileIdentifyingConn,1.01
TestDefaultTransportWithMemoryLimit/github.com/libp2p/go-libp2p/p2p/muxer/testsuite.SubtestStreamOpenStress,1.06
TestDefaultTransportWithMemoryLimit,1.46
TestStreamsWithLatency,1.5
TestDefaultTransport/github.com/libp2p/go-libp2p/p2p/muxer/testsuite.SubtestStreamOpenStress,1.57
TestObsAddrSet,1.57
TestConnectionGating,1.59
TestTcpTransport,1.82
TestDsPeerstore/Leveldb/BasicPeerstore,1.91
TestDefaultTransport/github.com/libp2p/go-libp2p/p2p/muxer/testsuite.SubtestStreamOpenStress,1.95
TestDialWorkerLoopConcurrentFailureStress,2
TestRelayLimitTime,2.01
TestAutoNATPublictoPrivate,2.02
TestDefaultTransport,2.02
TestObservedAddrFiltering,2.05
TestStatelessReset,2.13
TestInMemoryPeerstore/BasicPeerstore,2.14
TestDefaultTransport,2.41
TestDsKeyBook/Leveldb/PeersWithKeys,2.47
TestInMemoryKeyBook/PeersWithKeys,2.49
TestDsPeerstore/Leveldb,2.71
TestDsPeerstore,2.72
TestLimitedStreams,2.73
TestInMemoryKeyBook,2.81
TestInMemoryPeerstore,2.91
TestStressLimiter,2.99
TestQuickBurstRespectsSilencePeriod,3
TestDsKeyBook,3.06
TestDsKeyBook/Leveldb,3.06
marten-seemann commented 1 year ago

I just discovered https://github.com/pl-strflt/tf-aws-gh-runner. @galargh What would need to be done to use these runners for go-libp2p? It looks like it would clash with Unified CI, since the runs-on is hardcoded.

galargh commented 1 year ago

I just discovered https://github.com/pl-strflt/tf-aws-gh-runner. @galargh What would need to be done to use these runners for go-libp2p? It looks like it would clash with Unified CI, since the runs-on is hardcoded.

Let's make them configurable! https://github.com/protocol/.github/pull/443

GitHub just announced configuration variables which, in my opinion, are a pretty good fit for this.

marten-seemann commented 1 year ago

Any chance we can pick this up at some point? Slow CI probably slowed me down by 1-2h while cutting the v0.27.0 release. The most important thing would be to dramatically increase horizontal scaling. When merging multiple PRs, jobs get queued up for a very long time, since no runners are available.

galargh commented 1 year ago

I just put ubuntu jobs from Go Test workflow on self-hosted GHA runner. Self-hosted runners are sourced from a separate pool which should free up the resources for other jobs. Additionally, it speeds up the execution of the job ~13m -> ~6m.

This change will only affect the runs initiated in this repository (i.e. workflow runs triggered on pull_request from forks will keep running on GH machines).

If we happen to have to revert the change, here are the instructions:

  1. Go to https://github.com/libp2p/go-libp2p/settings/variables/actions
  2. Delete UCI_GO_TEST_RUNNER_UBUNTU Repository variables

Next steps from the IPDX side, prepare windows runners for use in Go Test workflow.

marten-seemann commented 1 year ago

Thank you @galargh, this is great!

Is the UCI_GO_TEST_RUNNER_UBUNTU format documented anywhere? I'm struggling making sense of it:

image
galargh commented 1 year ago

Good catch, we didn't mention the format in https://github.com/protocol/.github#configuration-variables before. Now I added it in https://github.com/protocol/.github/commit/43e27921600f05276a1259a9d9d67cdb9ec4e905.

UCI_*_RUNNER_* variables expect the values to be JSON formatted. For example, if you want the MacOS runner used in Go Test workflow to be macos-12 specifically, you'd set UCI_GO_TEST_RUNNER_MACOS to "macos-12" (notice the " around the string); and if you want your Ubuntu runner to be a self hosted machine with labels this, is, my, self-hosted, runner, you'd set UCI_GO_TEST_RUNNER_UBUNTU to ["this", "is", "my", "self-hosted", "runner"].

marten-seemann commented 1 year ago

What are these labels? Each label (except for the "go-libp2p") makes sense on its own, but I don't understand how they compose.

galargh commented 1 year ago

We use https://github.com/philips-labs/terraform-aws-github-runner to set up our self-hosted runners. It gives the runners os, architecture and self-hosted labels by default.

Additionally, we label our self-hosted runners with a runner group they belong to, see https://github.com/pl-strflt/tf-aws-gh-runner/blob/main/runners.tf#L150. In this case, the runner group we're using is go-libp2p defined at https://github.com/pl-strflt/tf-aws-gh-runner/blob/main/runners.tf#L80

As you can see by looking at the available Runner Types, we've been grouping the runners by repository so far but as the solution gains popularity, I'm thinking it'd make for an easier set up to group the runner types by sizes and only break out of this scheme if some repository really needs a dedicated pool of runners.

galargh commented 1 year ago

I added size-based groups to our self-hosted solution. The labels in this repo now read: ['self-hosted', 'linux', 'x64', '2xlarge'].