mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.57k stars 548 forks source link

run stable diffusion see no space left on device error #694

Open gaowayne opened 6 months ago

gaowayne commented 6 months ago

please see the error below, I have 28T disk, how to work around this problem?

 => => sha256:f0d07955f406a98d2a3ef5cd3cb1c559339dc6cbbbe209914ad33ad7141df159 54.92MB / 54.92MB                                                                                                                             58.7s
 => => sha256:63edc579ff7da32c44f2ecbb8cebdb8c0bc75afea5c55bfc40ca0faa1f25b971 15.71kB / 15.71kB                                                                                                                             58.9s
 => => sha256:8abf5ec0bb1316cac46570b2bc2d49b8628ce20f17ca13f7e77b260e5fc3da6c 512B / 512B                                                                                                                                   59.0s
 => => extracting sha256:a0e57127409620ffbd134fed398941297a33e6ac6666f11a8112b9912fa9c134                                                                                                                                    65.1s
 => => extracting sha256:03a691138ef8873ae8f244bfae84a8f425d6d970d9eb9e02025f09cf3e6ff73e                                                                                                                                     0.0s
 => => extracting sha256:9684ed3c71177bfaf2b3dd14a7f4f396e574be10c175a8eaafeddf71db897ca6                                                                                                                                     0.0s
 => => extracting sha256:c79b46181ee684152af9c05fbf145fd65a879155efe3656904c6782f738ce5a2                                                                                                                                     0.0s
 => => extracting sha256:91b6a7918caa109af3526868ecd34dd58368c1612f97b09c87d547fc550670c2                                                                                                                                     0.0s
 => => extracting sha256:d3ed2fbe7c334026828edfa722cc573ae82bf47aa8f3796d048a4e45c0021743                                                                                                                                     2.2s
 => => extracting sha256:79a28dd493566ad8f77a68bdb8827f450b8e08f6e8cbc0a97fd07b0d9f3e5f59                                                                                                                                     3.9s
 => => extracting sha256:02148fc97997080a40525c593dd6412858d7daf2ebc3fcfd31d113f3c2ce9fff                                                                                                                                     0.0s
 => => extracting sha256:6471a4765a47fc71de42baa86e1da4c9c6cfbd2cd60d683175b12602ae342061                                                                                                                                    17.4s
 => => extracting sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1                                                                                                                                     0.0s
 => => extracting sha256:824e6993c2a6f049af207439226b8f6cea693d6ef5f8751e52dcff31997451c0                                                                                                                                     4.0s
 => => extracting sha256:bcd2dfdccd094718505d87d069d9c0682d8d6285ba39bc5ab7b72a1281d63075                                                                                                                                     0.8s
 => => extracting sha256:499d104c2b1eb9e3bf33835377ec125a657b5f55e42f0ea88d25a92426dff428                                                                                                                                     1.4s
------
 > [1/5] FROM nvcr.io/nvidia/pytorch:22.12-py3@sha256:09a80f272dd173c9d8f28c23a1985aebe2bd3edd41a184ee9634f6e3f8a1f63d:
------
Dockerfile:2
--------------------
   1 |     ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.12-py3
   2 | >>> FROM ${FROM_IMAGE_NAME}
   3 |     
   4 |     ENV DEBIAN_FRONTEND=noninteractive
--------------------
ERROR: failed to solve: failed to register layer: write /usr/local/lib/python3.8/dist-packages/cmake/data/bin/ctest: no space left on device
dcg@oq1:/mnt/nvme1n1/mlperf/ubuntu/training/stable_diffusion$ df -h
Filesystem                         Size  Used Avail Use% Mounted on
tmpfs                               13G  3.0M   13G   1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv   98G   84G  9.7G  90% /
tmpfs                               63G     0   63G   0% /dev/shm
tmpfs                              5.0M     0  5.0M   0% /run/lock
/dev/sda2                          2.0G  217M  1.6G  12% /boot
/dev/sda1                          1.1G  6.1M  1.1G   1% /boot/efi
tmpfs                               13G  8.0K   13G   1% /run/user/1000
/dev/nvme1n1                        28T  288G   28T   2% /mnt/nvme1n1
dcg@oq1:/mnt/nvme1n1/mlperf/ubuntu/training/stable_diffusion$ 
ahmadki commented 6 months ago

The base docker image nvcr.io/nvidia/pytorch:22.12-py3 is over 18[GB], you can use docker info to check where docker stores theimages (/var/lib/docker/overlay2 on Debian based systems) but I can see you have only 9.7[GB] under your root folder.

nv-rborkar commented 4 months ago

@gaowayne if your issue is resolved, can you please close this?