iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.56k stars 574 forks source link

Improve efficiency of sharing artifacts between multiple CI machines #9881

Closed GMNGeoffrey closed 1 month ago

GMNGeoffrey commented 2 years ago

We've got a workflow that builds and tests the IREE runtime in two different jobs. One job builds and uploads the build directory as an artifact and another fetches the artifact and runs the tests. This adds some overhead, but basically works.

I've tried to do the same thing for the full build and the build directory is prohibitively large (a bit over 5G). It takes over half an hour to compress and upload. We need to come up with another solution here. Ideas:

Make passing the whole build dir work:

  1. Switching to GCS storage might give us better upload performance as it's going to have better network connections to the VM. That doesn't help with creating the archive though, which takes 20 minutes if we do the compression locally (https://github.com/iree-org/iree/runs/7476324873). If we don't compress locally, then creating the archive is fast, but I anticipate that even GCS won't save us there. The GitHub artifact upload took almost half an hour before the MIG killed it because it wasn't using enough CPU to look in use (https://github.com/iree-org/iree/runs/7476136990).
  2. We could try using a shared multi-writer disk (https://cloud.google.com/compute/docs/disks/sharing-disks-between-vms) between VMs. That's pretty problematic on presubmit, since we want those runners to be ephemeral and not be allowed to leave stuff around. It could maybe be achieved with some shenanigans, potentially, where the disk lives only as long as the runners and somehow they coordinate, but that's pretty fiddly.

Make it so we don't have to pass the whole build dir around:

  1. At some point we had an experiment with installable tests. It would be really nice to have that back and cut things down to only the dependencies actually needed for tests.
  2. More than 80% of the build directory is third_party/llvm-project We could try opportunistically removing things like parts this directory from the archive. With a quick test, there's stuff we need under llvm, but we can remove several gigs by dropping llvm/lib. This is pretty ad-hoc, so it would be nice to have a more systematic way to do this.

@stellaraccident any ideas? Especially on the second class of options

pzread commented 2 years ago

For the first class, maybe we can try other compression formats, like LZMA2 (.xz)?

I randomly tested the parallel LZMA2 compression (XZ_DEFAULTS="-T 32" tar --xz -cf a.tar.xz iree-build) with 32, 96, and 128 threads:

while the archiving without compression (tar -cf a.tar iree-build) took 50s on my machine.

The uncompressed archive is 20G and the LZMA2 compressed archive is 3G.

GMNGeoffrey commented 2 years ago

Ah that will help with archive creation time at least. I foolishly assumed that compression would be parallel by default. Network transfer is still going to be an issue. I'd like to defer moving to a different storage solution because then we need to figure out how to ensure that a presubmit runner can't leave any artifacts that could pollute future runs.

It looks like xz gives better compression, but is actually much slower to compress than other methods (https://linuxreviews.org/Comparison_of_Compression_Algorithms). I'll do some experimenting. Thanks for the suggestion

GMNGeoffrey commented 2 years ago

Well my IO sucks, so I'm getting way longer times. Tried things out on a ramdisk and based on that I think that zstd is probably best using a middling compression level

```shell $ du -sh ~/build/iree/build_all 17G /usr/local/google/home/gcmn/build/iree/build_all $ sync; sudo su -c 'echo 3 > /proc/sys/vm/drop_caches' $ time tar -c -I 'xz -9 -T0' -f /tmp/build_all.tar.xz ~/build/iree/build_all tar: Removing leading `/' from member names real 8m38.933s user 147m50.837s sys 2m27.689s $ sync; sudo su -c 'echo 3 > /proc/sys/vm/drop_caches' $ time tar -c -I 'pigz' -f /tmp/build_all.tar.gz ~/build/iree/build_all tar: Removing leading `/' from member names real 6m30.903s user 14m49.111s sys 4m17.362s $ sync; sudo su -c 'echo 3 > /proc/sys/vm/drop_caches' $ time tar -I"zstd -19 -T0" -cf /tmp/build_all.tar.zstd ~/build/iree/build_all tar: Removing leading `/' from member names real 7m6.323s user 147m52.863s sys 1m8.628s $ ls -lht /tmp/build_all.tar.* -rw-r----- 1 gcmn primarygroup 2.4G Jul 25 09:41 /tmp/build_all.tar.zst -rw-r----- 1 gcmn primarygroup 3.8G Jul 25 09:17 /tmp/build_all.tar.gz -rw-r----- 1 gcmn primarygroup 2.1G Jul 25 09:00 /tmp/build_all.tar.xz ``` This looks IO bound to me. Let's put everything on a ramdisk and find out ``` $ time tar -c -I 'pigz' -f build_all.tar.gz build_all real 0m31.429s user 13m48.482s sys 1m50.006s $ time tar -c -I 'xz -9 -T0' -f build_all.tar.xz build_all real 3m9.907s user 206m25.682s sys 6m5.922s # Ok we don't actually need max compression $ time tar -c -I 'xz -T0' -f build_all.tar.xz build_all real 2m5.667s user 184m1.213s sys 1m26.608s $ du -h build_all.tar.xz 2.3G build_all.tar.xz $ time tar -c -I 'zstd -T0' -f build_all.tar.zst build_all real 0m17.669s user 1m31.609s sys 0m22.931s $ du -h build_all.tar.zst 3.4G build_all.tar.zst $ time tar -c -I 'zstd -10 -T0' -f build_all.tar.zst build_all real 0m23.273s user 11m23.647s sys 0m29.309s $ du -h build_all.tar.zst 3.0G build_all.tar.zst $ time tar -c -I 'zst -15 -T0' -f build_all.tar.zstd build_all real 1m14.890s user 44m12.739s sys 0m32.693s $ du -h build_all.tar.zst 2.9G build_all.tar.zst $ time tar -c -I 'zstd -19 -T0' -f build_all.tar.zst build_all real 4m20.369s user 151m39.581s sys 0m38.468s $ du -h build_all.tar.zst 2.4G build_all.tar.zst ``` Indeed. Stupid HDD

I actually don't quite understand how the build dir got to be 17G either. I swear it was 5G before. Also unclear why even on a ram disk, xz is slower for me than it is for you. I guess I really should've tested this on the actual VMs we're using instead of my personal one... Anyway, I think we're probably going to want some combination of deleting unnecessary stuff from the build dir, compressing in parallel, and doing as much work as possible in a ram disk on the VM. Then we can figure out how to use our own storage for artifacts later.

GMNGeoffrey commented 2 years ago

Ok for a first pass at removing unnecessary stuff, we can delete all static libraries and object files. That drops 10G

$ find ~/build/iree/build_all \( -name "*.a" -o -name "*.o" \) -delete

That makes the biggest directories tools/ and compilers/ which seems like more as it should be. All tests pass. Does anyone foresee any issues with deleting all these files?

benvanik commented 2 years ago

shouldn't cause issues AFAIK

GMNGeoffrey commented 2 years ago

Of course, IRL we'd just not archive them

$ tar --exclude '*.a' --exclude '*.o' -c -I 'zstd -10 -T0' -f build_all.tar.zst build_all
GMNGeoffrey commented 2 years ago

Ok this appears to have made things even progress at all, but transfer speeds are dirt slow: https://github.com/iree-org/iree/runs/7508575022. Maybe we will need to roll our own storage already. Another place GitHub actions loses to Buildkite: the latter lets you specify your own backing store :-(

GMNGeoffrey commented 2 years ago

Lol 17 minutes to upload. Yeah that's going to be a problem. Even if we could somehow compress by another factor of two, that's not going to fly. Testing was fast (22s), but all the vulkan tests failed because I ran them in the base container without swiftshader. Ideas for how we can segregate artifact upload ACLs to our own storage by commit or workflow run would be most welcome

GMNGeoffrey commented 2 years ago

Ah, how about including the artifact digest in the job outputs and the artifact file name. Then a runner could write some other file, but nothing should ever want to download it. We can also give the runners only "create" permissions, so that they can't overwrite any existing artifacts produced. We'd still want to use the digest, since it means that the path where an artifact will be won't be predictable

GMNGeoffrey commented 2 years ago

Eh, actually I think I take back that last bit. If something has already put an artifact there, the job will fail to upload since it doesn't have overwrite permissions. So we can just use "folders" (GCS doesn't really have folders) by commit hash.

GMNGeoffrey commented 2 years ago

That's more like it: https://github.com/iree-org/iree/runs/7512948238. 93s to compress the archive, 15s to upload it, 5s to download it, and 238s to extract it. So decompression is pretty slow here. These times suggest we probably want to do less compression.

Bringing xz back to compression level 3 (default 6): https://github.com/iree-org/iree/runs/7513234765. 35s to compress, 21s to upload, 4s to download, 255s to extract. :-/ no help to decompression, but a bit faster overall. I think that we've got something workable here though. The whole thing completes in under 15 minutes, which seems OK for starting out. It's faster than our slowest Kokoro build (which usually takes like 20 minutes). We can optimize after we kill Kokoro.

stellaraccident commented 2 years ago

Have you tried stripping all executables? Some kind of find piped to xargs strip.

GMNGeoffrey commented 2 years ago

Have you tried stripping all executables? Some kind of find piped to xargs strip.

You mean of debug info? I think we want to keep that for running the tests

stellaraccident commented 2 years ago

Will, then, it is going to be huge and there isn't much to be done about it. Although some savings can be had by tweaking flags to only have line numbers (gmlt).

stellaraccident commented 2 years ago

Switching the compiler to shared library builds would help with the big, duplicated binary problem a lot. It does come with a measurable performance cost though and should not be done for runtime code.

Then also tweaking debug flags down to just gmlt.

Finally, enabling split dwarf separates debug symbols into their own files, and if you get the flags right, will effectively dedup them across executables/archives/etc. There are likely some compression time benefits from not having the highly compressible debug date intermixed with large binaries.

All of those options except gmlt are likely useful for regular dev flows as well.

GMNGeoffrey commented 2 years ago

then, it is going to be huge and there isn't much to be done about it

Yeah, I was more looking for ways to isolate only the files we need for testing (e.g. installable tests). I think we get most of that savings be excluding the intermediate compilation artifacts, as I've done. I tried out using GCS and I think it's workable though still could use some optimization (decompression is still pretty slow and looks IO bottlenecked). I'll run my idea past security to see if I'm missing some way that this significantly increases any risks we care about

GMNGeoffrey commented 2 years ago

I did some experimenting on a VM that actually matches our CI machines. zstd is way faster at decompressing than xz and somewhat faster at compressing (for a full iree build dir). I'm currently doing some quick benchmarking to determine the best compression level (we're still talking about minutes on the CI, so seems worth getting right) and then I'll push a new image with zstd installed and switch to that.

powderluv commented 2 years ago

Drive by: We setup a Google Filestore NFS for our cloud infra and just pass pointers to the data that is frontend-ed by Jenkins ondemand VMs.

GMNGeoffrey commented 2 years ago

I think that's pretty similar to what we're doing. It looks like Filestore may be faster, but a disadvantage is that I don't think we're going to be able to do things like only allowing creating an object and not editing it. I'll take a look to see if it's better for our needs, thanks

powderluv commented 2 years ago

Yeah use it for distributed synchronization of jobs etc and then when done artifacts we care about gets gsutil cped into where ever it has to go. Filestore has two tiers and you can get the faster tier if you have a lot of nodes accessing it. You can also have a git mirror -- and all CI jobs can clone with reference so all the checkout time is cut short.

GMNGeoffrey commented 2 years ago

I'm going to stick with what we have for now and revisit Filestore after the migration. Filestore also looks like it might be promising for a shared ccache that could be read-only for the presubmit runners and read-write for the postsubmit. For that we really would want something that behaves like a file system. Any NFS file system may be too slow to be worth it for that though. I had been thinking about copying the cache over to local storage (or even ramdisk) when the runner starts up. There are also multi-writer persistent disks: https://cloud.google.com/compute/docs/disks/sharing-disks-between-vms, although those look a like they're a bit fiddly and they're only available for N2 instances for some reason. I think this is all something we can address as optimizations post migration though. What we've got now is sufficiently fast to get us there

allieculp commented 1 year ago

@GMNGeoffrey Is this still open?

GMNGeoffrey commented 1 year ago

Sort of... I'm still not totally satisfied with what we have and there are some clear optimizations, but I'm not actively working on improving it right now. I'm not sure whether it's useful to keep this open or not

allieculp commented 1 year ago

Suggest we probably downgrade in priority or move status to the backlog. Thoughts?

GMNGeoffrey commented 1 year ago

Renamed, dropped priority, and moved to backlog

GMNGeoffrey commented 1 year ago

Unassigning myself from issues that I'm not actively working on

ScottTodd commented 9 months ago

Migrating more build/test jobs to the pkgci style and using whatever magic Stella set up with local git mirrors (discord discussion here, workflow file here) should both help with the issues here

ScottTodd commented 1 month ago

Closing this old issue. I've migrated most jobs to "pkgci", which passes about 100MB of data (just the python wheels) between jobs using GitHub's artifacts. Storage operations are fast enough with that little data that we don't need a custom cloud solution or other optimization.