Open Pawka opened 1 year ago
P.S. We might want to make this feature opt-in by providing threshold via flag.
Thank you for the PR! This would be a great thing to improve. It has been a little while since I've looked at this code. I believe the only special case it was handling previously was when no packages had any timing data. In that case it was using round robin to distribute the packages.
Generalizing it so that all packages with 0 timing data are round robin distributed seems like a great improvement. A couple questions for you about this approach. What's the reason for using a threshold of 5ms instead of using 0? Is it common for your packages to start out with very little test?
What do you think about an approach like this:
TotalZeroTime
field to the bucket
struct, which will count the number of packages assigned to that bucket that had 0 elapsed timeTotalZeroTime
. That way new packages should be evenly distributed between the buckets, starting with the bucket with the least amount of time.Thanks for reply!
What's the reason for using a threshold of 5ms instead of using 0?
It is just arbitrary number which I've picked to treat a test as negligible. Even 100 of 5ms tests won't increase duration of a bucket significantly so it might make sense to distribute those among buckets. But I feel we might want to make this value configurable via flag. What do you think?
As a special case we could avoid the largest bucket when playing packages with 0 elapsed time. <...>
100% to this. It is better to avoid adding additional tests to the "longest" bucket. I was thinking is it enough just to exclude one or some arbitrary part of buckets such as 20% but no less than 1?
Few comments on round robin vs consistent assigning since we need to agree which path to choose.
I see two ways how ci-matrix
can be used:
ci-matrix
distribution file is reused for extended period. Benefit of this approach is that same tests always will be assigned to the same shards what might help to achieve higher Golang cache hit rate.While cache hit rate for 0s packages is not important but new packages, which are not in distribution file, are also treated as 0s. But in reality those will run longer. If the 2nd approach is used - it is better to keep consistency and assign specific package to the same bucket so build cache could be reused. To stay consistent round robin approach can not be used because a new package would shift all packages after it by one.
Let me know if you think this is something what you'd like to adopt and I can make changes to this PR.
After some investigation I've found that the best way is just to ignore packages without tests at all. Running go test
for a package with no tests still takes time. In my experiment running go test
for ~70 empty packages took ~30-120s (with no cache). This might become significant overhead for huge repositories.
@dnephin what do you think about excluding packages from matrix if Elapsed
value is 0?
To stay consistent round robin approach can not be used because a new package would shift all packages after it by one.
One important detail, that may be not be obvious, is that the packages are sorted by elapsed time before attempting to place them into buckets. This means that packages with 0 elapsed time will always be last. I think that makes it safe to round-robin them, because only new packages or packages with no tests could come later. It will never impact existing packages with non-zero elapsed time.
ignore packages without tests at all
How do we know if the package has no tests, or if the package is new and has no previous runtime? I'm not sure if this is safe, because it sounds like it could accidentally skip tests in new packages.
In my experiment running go test for ~70 empty packages took ~30-120s
Do you think it is common for projects to have that many packages without tests? I've seen projects with a few small packages with no tests (maybe 10-20 packages), but generally the time to compile those packages is negligible compared to the time required to run all the other tests.
Yeah, I agree with you - standard project probably won't have much packages without tests.
On a flip side empty packages can be excluded without any changes to gotestsum
- by using go list -json ./... | jq 'select(.TestGoFiles | length > 0) | .ImportPath' -r
instead of simple go list ./...
. So probably it is not worth to extend gotestsum
here. Sorry for jumping back and forward :)
I'll update this PR in near time to support round robin approach.
Currently all new packages which are not in timing files are assigned to a single bucket with shortest duration. This happens because duration of "unknown" packages is set to 0 but in reality it might be longer. There are also empty packages with no tests which have 0s duration but there is little overhead to run those tests.
This patch changes how those are distributed between shards to avoid all all of them added to a single bucket. If package run duration is lower than threshold - assign it by using hash function instead of minimal bucket. Hashing function is consistent what means the same package is assigned to the same bucket if package's name and count of buckets is not changed. This helps to reuse go build cache.
Bucket name is retrieved by calculating md5 hash of package's name, then taking part of hash, converting to integer and calculating modulus by buckets count. Only a part of hash is converted to integer to avoid overflow - converting md5 hex to integer might give a result bigger than
int64
. I've tested distribution of such method and it seems to be pretty balanced. The 1st column shows how many packages were assigned to the bucket.This approach can be used when sharding strategy is not updated on each test run but is reused for extended period (e.g. week) because new tests will be distributed among shards.