Open marten-seemann opened 2 years ago
I can't assign myself or the ipdx
tag yet, but we're looking into this,
I created a PR to show how we could run the test in separate jobs: https://github.com/libp2p/go-libp2p/pull/1639
We need to investigate some more and make sure this would be an improvement:
(cc @galargh)
We should also see if there are any test cases that are particularly slow. This is not a replacement for optimizing our CI setup, but it might help nonetheless.
I found this tutorial in the Google cache: https://webcache.googleusercontent.com/search?q=cache:5NYzDwHIEj8J:https://leighmcculloch.com/posts/go-find-slow-tests/&cd=3&hl=en&ct=clnk&gl=il
Step 1: Run the tests with verbose json output, without caching.
go test -count=1 -v -json | tee testoutput.txt
Step 2: Extract from that output a sorted list of tests and their elapsed times.
cat testoutput.txt \
| jq -r 'select(.Action == "pass" and .Test != null) | .Test + "," + (.Elapsed | tostring)' \
| sort -k2 -n -t,
These are all the tests taking longer than 0.5s (unfortunately, without the package names). The number behind the test name is the duration in seconds.
TestConstraintsCleanup,0.5
TestResolutionMisaligned,0.51
TestTcpTransport/github.com/libp2p/go-libp2p/p2p/transport/testsuite.SubtestStreamOpenStress#01,0.56
TestFuzzManyPeers,0.57
TestListenerCloseClosesQueued,0.6
TestTcpTransport/github.com/libp2p/go-libp2p/p2p/transport/testsuite.SubtestStreamOpenStress,0.61
TestFailuresOnResponder,0.64
TestInMemoryPeerstore/Metadata,0.67
TestInMemoryPeerstore/Metadata/putting_and_getting,0.67
TestDsPeerstore/Leveldb/Metadata,0.72
TestDsPeerstore/Leveldb/Metadata/putting_and_getting,0.72
TestBackoffConnector,0.74
TestStressActiveDial,0.89
TestAutoNATPublic,1
TestDeadlines,1
TestDialWorkerLoopConcurrentFailure,1
TestDialWorkerLoopFailure,1
TestHostAddrChangeDetection,1
TestLocalIPChangesWhenListenAddrChanges,1
TestAutoNATPrivate,1.01
TestIdentifyDeltaWhileIdentifyingConn,1.01
TestDefaultTransportWithMemoryLimit/github.com/libp2p/go-libp2p/p2p/muxer/testsuite.SubtestStreamOpenStress,1.06
TestDefaultTransportWithMemoryLimit,1.46
TestStreamsWithLatency,1.5
TestDefaultTransport/github.com/libp2p/go-libp2p/p2p/muxer/testsuite.SubtestStreamOpenStress,1.57
TestObsAddrSet,1.57
TestConnectionGating,1.59
TestTcpTransport,1.82
TestDsPeerstore/Leveldb/BasicPeerstore,1.91
TestDefaultTransport/github.com/libp2p/go-libp2p/p2p/muxer/testsuite.SubtestStreamOpenStress,1.95
TestDialWorkerLoopConcurrentFailureStress,2
TestRelayLimitTime,2.01
TestAutoNATPublictoPrivate,2.02
TestDefaultTransport,2.02
TestObservedAddrFiltering,2.05
TestStatelessReset,2.13
TestInMemoryPeerstore/BasicPeerstore,2.14
TestDefaultTransport,2.41
TestDsKeyBook/Leveldb/PeersWithKeys,2.47
TestInMemoryKeyBook/PeersWithKeys,2.49
TestDsPeerstore/Leveldb,2.71
TestDsPeerstore,2.72
TestLimitedStreams,2.73
TestInMemoryKeyBook,2.81
TestInMemoryPeerstore,2.91
TestStressLimiter,2.99
TestQuickBurstRespectsSilencePeriod,3
TestDsKeyBook,3.06
TestDsKeyBook/Leveldb,3.06
I just discovered https://github.com/pl-strflt/tf-aws-gh-runner.
@galargh What would need to be done to use these runners for go-libp2p? It looks like it would clash with Unified CI, since the runs-on
is hardcoded.
I just discovered https://github.com/pl-strflt/tf-aws-gh-runner. @galargh What would need to be done to use these runners for go-libp2p? It looks like it would clash with Unified CI, since the
runs-on
is hardcoded.
Let's make them configurable! https://github.com/protocol/.github/pull/443
GitHub just announced configuration variables which, in my opinion, are a pretty good fit for this.
Any chance we can pick this up at some point? Slow CI probably slowed me down by 1-2h while cutting the v0.27.0 release. The most important thing would be to dramatically increase horizontal scaling. When merging multiple PRs, jobs get queued up for a very long time, since no runners are available.
I just put ubuntu
jobs from Go Test
workflow on self-hosted GHA runner. Self-hosted runners are sourced from a separate pool which should free up the resources for other jobs. Additionally, it speeds up the execution of the job ~13m -> ~6m.
This change will only affect the runs initiated in this repository (i.e. workflow runs triggered on pull_request from forks will keep running on GH machines).
If we happen to have to revert the change, here are the instructions:
UCI_GO_TEST_RUNNER_UBUNTU
Repository variablesNext steps from the IPDX side, prepare windows runners for use in Go Test workflow.
Thank you @galargh, this is great!
Is the UCI_GO_TEST_RUNNER_UBUNTU
format documented anywhere? I'm struggling making sense of it:
Good catch, we didn't mention the format in https://github.com/protocol/.github#configuration-variables before. Now I added it in https://github.com/protocol/.github/commit/43e27921600f05276a1259a9d9d67cdb9ec4e905.
UCI_*_RUNNER_*
variables expect the values to be JSON formatted. For example, if you want theMacOS
runner used inGo Test
workflow to bemacos-12
specifically, you'd setUCI_GO_TEST_RUNNER_MACOS
to"macos-12"
(notice the"
around the string); and if you want yourUbuntu
runner to be a self hosted machine with labelsthis
,is
,my
,self-hosted
,runner
, you'd setUCI_GO_TEST_RUNNER_UBUNTU
to["this", "is", "my", "self-hosted", "runner"]
.
What are these labels? Each label (except for the "go-libp2p") makes sense on its own, but I don't understand how they compose.
We use https://github.com/philips-labs/terraform-aws-github-runner to set up our self-hosted runners. It gives the runners os
, architecture
and self-hosted
labels by default.
Additionally, we label our self-hosted runners with a runner group they belong to, see https://github.com/pl-strflt/tf-aws-gh-runner/blob/main/runners.tf#L150. In this case, the runner group we're using is go-libp2p
defined at https://github.com/pl-strflt/tf-aws-gh-runner/blob/main/runners.tf#L80
As you can see by looking at the available Runner Types, we've been grouping the runners by repository so far but as the solution gains popularity, I'm thinking it'd make for an easier set up to group the runner types by sizes and only break out of this scheme if some repository really needs a dedicated pool of runners.
I added size-based groups to our self-hosted solution. The labels in this repo now read: ['self-hosted', 'linux', 'x64', '2xlarge']
.
CI is taking a long time to run our builds. We should
go test
andgo test -race
sequentially. These can be separate jobs.