Closed ScottTodd closed 1 week ago
I can try but it might take time. For some reason these are created with a single disk that is also the boot disk. So I have to shutdown the instance to change the disk and it may take time to get another A100.
Postsubmit tests started passing again as of https://github.com/iree-org/iree/commit/1ea21d1acf22b090126eb3d297b875f9072dfefe (2/2 postsubmit runs passed so far)
Was the runner updated? Even if not, things might be fine for now?
Seems to be stable now.
No I haven't had a chance to redo the runner yet. Was hoping to do it this afternoon. Should I just leave it alone?
Following up on https://github.com/iree-org/iree/pull/17661#issuecomment-2165987433
The
test_nvidia_a100
CI job has been failing to download the docker image withSample logs: https://github.com/iree-org/iree/actions/runs/9519620228/job/26243514700#step:8:60
Debugging shows that the postsubmit runner (
iree-persistent-a100-2
) has a 100GB disk: https://github.com/iree-org/iree/actions/runs/9520428877/job/26245718633#step:4:11 Compared to a 1TB disk for the presubmit runner: https://github.com/iree-org/iree/actions/runs/9520428877/job/26245718166#step:4:11Can we recreate the runner with a larger disk? 100GB will be tight to fit a 12GB docker image, build artifacts, and other files.