Closed GMNGeoffrey closed 1 month ago
For the first class, maybe we can try other compression formats, like LZMA2 (.xz)?
I randomly tested the parallel LZMA2 compression (XZ_DEFAULTS="-T 32" tar --xz -cf a.tar.xz iree-build
) with 32, 96, and 128 threads:
while the archiving without compression (tar -cf a.tar iree-build
) took 50s on my machine.
The uncompressed archive is 20G and the LZMA2 compressed archive is 3G.
Ah that will help with archive creation time at least. I foolishly assumed that compression would be parallel by default. Network transfer is still going to be an issue. I'd like to defer moving to a different storage solution because then we need to figure out how to ensure that a presubmit runner can't leave any artifacts that could pollute future runs.
It looks like xz gives better compression, but is actually much slower to compress than other methods (https://linuxreviews.org/Comparison_of_Compression_Algorithms). I'll do some experimenting. Thanks for the suggestion
Well my IO sucks, so I'm getting way longer times. Tried things out on a ramdisk and based on that I think that zstd is probably best using a middling compression level
I actually don't quite understand how the build dir got to be 17G either. I swear it was 5G before. Also unclear why even on a ram disk, xz is slower for me than it is for you. I guess I really should've tested this on the actual VMs we're using instead of my personal one... Anyway, I think we're probably going to want some combination of deleting unnecessary stuff from the build dir, compressing in parallel, and doing as much work as possible in a ram disk on the VM. Then we can figure out how to use our own storage for artifacts later.
Ok for a first pass at removing unnecessary stuff, we can delete all static libraries and object files. That drops 10G
$ find ~/build/iree/build_all \( -name "*.a" -o -name "*.o" \) -delete
That makes the biggest directories tools/
and compilers/
which seems like more as it should be. All tests pass. Does anyone foresee any issues with deleting all these files?
shouldn't cause issues AFAIK
Of course, IRL we'd just not archive them
$ tar --exclude '*.a' --exclude '*.o' -c -I 'zstd -10 -T0' -f build_all.tar.zst build_all
Ok this appears to have made things even progress at all, but transfer speeds are dirt slow: https://github.com/iree-org/iree/runs/7508575022. Maybe we will need to roll our own storage already. Another place GitHub actions loses to Buildkite: the latter lets you specify your own backing store :-(
Lol 17 minutes to upload. Yeah that's going to be a problem. Even if we could somehow compress by another factor of two, that's not going to fly. Testing was fast (22s), but all the vulkan tests failed because I ran them in the base container without swiftshader. Ideas for how we can segregate artifact upload ACLs to our own storage by commit or workflow run would be most welcome
Ah, how about including the artifact digest in the job outputs and the artifact file name. Then a runner could write some other file, but nothing should ever want to download it. We can also give the runners only "create" permissions, so that they can't overwrite any existing artifacts produced. We'd still want to use the digest, since it means that the path where an artifact will be won't be predictable
Eh, actually I think I take back that last bit. If something has already put an artifact there, the job will fail to upload since it doesn't have overwrite permissions. So we can just use "folders" (GCS doesn't really have folders) by commit hash.
That's more like it: https://github.com/iree-org/iree/runs/7512948238. 93s to compress the archive, 15s to upload it, 5s to download it, and 238s to extract it. So decompression is pretty slow here. These times suggest we probably want to do less compression.
Bringing xz back to compression level 3 (default 6): https://github.com/iree-org/iree/runs/7513234765. 35s to compress, 21s to upload, 4s to download, 255s to extract. :-/ no help to decompression, but a bit faster overall. I think that we've got something workable here though. The whole thing completes in under 15 minutes, which seems OK for starting out. It's faster than our slowest Kokoro build (which usually takes like 20 minutes). We can optimize after we kill Kokoro.
Have you tried stripping all executables? Some kind of find piped to xargs strip.
Have you tried stripping all executables? Some kind of find piped to xargs strip.
You mean of debug info? I think we want to keep that for running the tests
Will, then, it is going to be huge and there isn't much to be done about it. Although some savings can be had by tweaking flags to only have line numbers (gmlt).
Switching the compiler to shared library builds would help with the big, duplicated binary problem a lot. It does come with a measurable performance cost though and should not be done for runtime code.
Then also tweaking debug flags down to just gmlt.
Finally, enabling split dwarf separates debug symbols into their own files, and if you get the flags right, will effectively dedup them across executables/archives/etc. There are likely some compression time benefits from not having the highly compressible debug date intermixed with large binaries.
All of those options except gmlt are likely useful for regular dev flows as well.
then, it is going to be huge and there isn't much to be done about it
Yeah, I was more looking for ways to isolate only the files we need for testing (e.g. installable tests). I think we get most of that savings be excluding the intermediate compilation artifacts, as I've done. I tried out using GCS and I think it's workable though still could use some optimization (decompression is still pretty slow and looks IO bottlenecked). I'll run my idea past security to see if I'm missing some way that this significantly increases any risks we care about
I did some experimenting on a VM that actually matches our CI machines. zstd is way faster at decompressing than xz
and somewhat faster at compressing (for a full iree build dir). I'm currently doing some quick benchmarking to determine the best compression level (we're still talking about minutes on the CI, so seems worth getting right) and then I'll push a new image with zstd installed and switch to that.
Drive by: We setup a Google Filestore NFS for our cloud infra and just pass pointers to the data that is frontend-ed by Jenkins ondemand VMs.
I think that's pretty similar to what we're doing. It looks like Filestore may be faster, but a disadvantage is that I don't think we're going to be able to do things like only allowing creating an object and not editing it. I'll take a look to see if it's better for our needs, thanks
Yeah use it for distributed synchronization of jobs etc and then when done artifacts we care about gets gsutil cp
ed into where ever it has to go. Filestore has two tiers and you can get the faster tier if you have a lot of nodes accessing it. You can also have a git mirror -- and all CI jobs can clone with reference so all the checkout time is cut short.
I'm going to stick with what we have for now and revisit Filestore after the migration. Filestore also looks like it might be promising for a shared ccache that could be read-only for the presubmit runners and read-write for the postsubmit. For that we really would want something that behaves like a file system. Any NFS file system may be too slow to be worth it for that though. I had been thinking about copying the cache over to local storage (or even ramdisk) when the runner starts up. There are also multi-writer persistent disks: https://cloud.google.com/compute/docs/disks/sharing-disks-between-vms, although those look a like they're a bit fiddly and they're only available for N2 instances for some reason. I think this is all something we can address as optimizations post migration though. What we've got now is sufficiently fast to get us there
@GMNGeoffrey Is this still open?
Sort of... I'm still not totally satisfied with what we have and there are some clear optimizations, but I'm not actively working on improving it right now. I'm not sure whether it's useful to keep this open or not
Suggest we probably downgrade in priority or move status to the backlog. Thoughts?
Renamed, dropped priority, and moved to backlog
Unassigning myself from issues that I'm not actively working on
Migrating more build/test jobs to the pkgci style and using whatever magic Stella set up with local git mirrors (discord discussion here, workflow file here) should both help with the issues here
Closing this old issue. I've migrated most jobs to "pkgci", which passes about 100MB of data (just the python wheels) between jobs using GitHub's artifacts. Storage operations are fast enough with that little data that we don't need a custom cloud solution or other optimization.
We've got a workflow that builds and tests the IREE runtime in two different jobs. One job builds and uploads the build directory as an artifact and another fetches the artifact and runs the tests. This adds some overhead, but basically works.
I've tried to do the same thing for the full build and the build directory is prohibitively large (a bit over 5G). It takes over half an hour to compress and upload. We need to come up with another solution here. Ideas:
Make passing the whole build dir work:
Make it so we don't have to pass the whole build dir around:
third_party/llvm-project
We could try opportunistically removing things like parts this directory from the archive. With a quick test, there's stuff we need under llvm, but we can remove several gigs by droppingllvm/lib
. This is pretty ad-hoc, so it would be nice to have a more systematic way to do this.@stellaraccident any ideas? Especially on the second class of options