iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.47k stars 548 forks source link

CI runner for test_nvidia_a100 is running out of disk space #17677

Closed ScottTodd closed 1 week ago

ScottTodd commented 1 week ago

Following up on https://github.com/iree-org/iree/pull/17661#issuecomment-2165987433

The test_nvidia_a100 CI job has been failing to download the docker image with

docker: failed to register layer: write /var/cuda-repo-ubuntu2004-12-2-local/nsight-systems-2023.2.3_2023.2.3.1001-1_amd64.deb: no space left on device.

Sample logs: https://github.com/iree-org/iree/actions/runs/9519620228/job/26243514700#step:8:60

Debugging shows that the postsubmit runner (iree-persistent-a100-2) has a 100GB disk: https://github.com/iree-org/iree/actions/runs/9520428877/job/26245718633#step:4:11 Compared to a 1TB disk for the presubmit runner: https://github.com/iree-org/iree/actions/runs/9520428877/job/26245718166#step:4:11

Can we recreate the runner with a larger disk? 100GB will be tight to fit a 12GB docker image, build artifacts, and other files.

yuennancy commented 1 week ago

I can try but it might take time. For some reason these are created with a single disk that is also the boot disk. So I have to shutdown the instance to change the disk and it may take time to get another A100.

ScottTodd commented 1 week ago

Postsubmit tests started passing again as of https://github.com/iree-org/iree/commit/1ea21d1acf22b090126eb3d297b875f9072dfefe (2/2 postsubmit runs passed so far)

Was the runner updated? Even if not, things might be fine for now?

ScottTodd commented 1 week ago

Seems to be stable now.

yuennancy commented 1 week ago

No I haven't had a chance to redo the runner yet. Was hoping to do it this afternoon. Should I just leave it alone?