hollowstrawberry / kohya-colab

Accessible Google Colab notebooks for Stable Diffusion Lora training, based on the work of kohya-ss and Linaqruf
GNU General Public License v3.0
599 stars 86 forks source link

Still getting broken/fried Loras #109

Closed heartbreakergaming closed 1 month ago

heartbreakergaming commented 6 months ago

While not as bad as before where they would come out a random color, anything past like 500 steps becomes super saturated, and over trained. And im using the same settings ive used for like the past year. which is mainly the default settings, and i try to get my repeats around 200-300. I even retrained lora's ive already done in the past with the same settings i used in the past and the newer ones are super saturated by epoch 3 at 200 steps per epoch.

heartbreakergaming commented 6 months ago

another thing to note is that installation used to take 300~ seconds and now it takes 900, and i get this error "failed to initialize: Found CUDA version 12010, but JAX was built against version 12020, which is newer. The copy of CUDA that is installed must be at least as new as the version against which JAX was built. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info"

githubnoot commented 6 months ago

I've trained a few LoRa the past few days, but now getting hit with the CalledProcessError.

And yeah! Installation takes more time to get settled, which is a huge bummer with eating up GPU time and resulting in an error.

hollowstrawberry commented 6 months ago

another thing to note is that installation used to take 300~ seconds and now it takes 900, and i get this error "failed to initialize: Found CUDA version 12010, but JAX was built against version 12020, which is newer. The copy of CUDA that is installed must be at least as new as the version against which JAX was built. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info"

Installation time sadly fluctuates based on the download speed of colab while it gets the necessary resources

This is from just now:

image

ArmyOfPun1776 commented 6 months ago

900 seconds... shiiii.. I just had one take almost 15 min. It always bottle necks for me at the CUDA stuffs.

Thinking about letting it run the setup on the CPU only and letting it fail then changing to a GPU runtime... Don't know if that's gonna kill the install though.. Probably will.

githubnoot commented 6 months ago

Here be my time.

00ko

There are times where it'll take around 13 minutes or so. I used to be able to throw it all together within a minute and happily run off to the kitchen for some snackies. But now I gotta babysit it more so I can shut it down after getting errors.

I went back and tweaked the trainer to having 5 epochs and also adjusting the unet / tenc, but all tinkering is slamming me back to the CalledProcessError / returned non-zero exit status 1 again. Glad it worked for me the other day so I could have some dopamine with some LoRas prevailing. But I am back to wallowing in sadness noises.

ArmyOfPun1776 commented 6 months ago

Here be my time.

00ko

There are times where it'll take around 13 minutes or so. I used to be able to throw it all together within a minute and happily run off to the kitchen for some snackies. But now I gotta babysit it more so I can shut it down after getting errors.

I went back and tweaked the trainer to having 5 epochs and also adjusting the unet / tenc, but all tinkering is slamming me back to the CalledProcessError / returned non-zero exit status 1 again. Glad it worked for me the other day so I could have some dopamine with some LoRas prevailing. But I am back to wallowing in sadness noises.

Huh... I just, like just finished training o 10 Epochs, 40 imgs, 2 repeats, 2e-4, 1e-4, 768Res 64:32 nets. image image and getting this result grid-0012 Those are bare gens. just Lora at weight .7 and the activator.

githubnoot commented 6 months ago

Huh... I just, like just finished training o 10 Epochs, 40 imgs, 2 repeats, 2e-4, 1e-4, 768Res 64:32 nets.

You lucky duck! From my experience, it seems like it can just... vary. There was a point when it was actively broken for everyone, that I'd run the same settings (through days) about 10 times and 1 would succeed.

I'll try your settings and report back if it still snags on my end or not. (Assuming I didn't run out of GPU for the day.)

githubnoot commented 6 months ago

Same error for me. I'll have to keep looking into it.

Regardless, thanks for posting your info! I know it will help others and is always nice to compare settings.

heartbreakergaming commented 6 months ago

Here be my time. 00ko There are times where it'll take around 13 minutes or so. I used to be able to throw it all together within a minute and happily run off to the kitchen for some snackies. But now I gotta babysit it more so I can shut it down after getting errors. I went back and tweaked the trainer to having 5 epochs and also adjusting the unet / tenc, but all tinkering is slamming me back to the CalledProcessError / returned non-zero exit status 1 again. Glad it worked for me the other day so I could have some dopamine with some LoRas prevailing. But I am back to wallowing in sadness noises.

Huh... I just, like just finished training o 10 Epochs, 40 imgs, 2 repeats, 2e-4, 1e-4, 768Res 64:32 nets. image image and getting this result grid-0012 Those are bare gens. just Lora at weight .7 and the activator.

ill try your settings and see if they work better! i prefer not to have to lower the weight though as in the past i havent had to unless trying to combine with other loras.

heartbreakergaming commented 6 months ago

another thing to note is that installation used to take 300~ seconds and now it takes 900, and i get this error "failed to initialize: Found CUDA version 12010, but JAX was built against version 12020, which is newer. The copy of CUDA that is installed must be at least as new as the version against which JAX was built. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info"

Installation time sadly fluctuates based on the download speed of colab while it gets the necessary resources

This is from just now:

image

Yeah its definitely not always like that ive tried a few more times since posting this issue, and sometimes its in the 200's and the longest ive seen was 1500 seconds however i figured id mention it regardless. im more concerned about the images still being over saturated rather than the install time.

dmikey commented 6 months ago

Training rates are weird, still pulling data will report back, but running the same data sets through results in very very ugly weights.

heartbreakergaming commented 6 months ago

I tried lowering the unet and honestly the difference is minimal, its a little better, but i tried your settings with 40 imgs, and its still pretty rough past step 500, though not super saturated. Honestly i wonder if something changed with kohya's scripts in general, and now it simply takes less steps to train, compared to before. How would i go about transferring these settings to kohya's scripts so i cant try training locally?

ArmyOfPun1776 commented 6 months ago

Training rates are weird, still pulling data will report back, but running the same data sets through results in very very ugly weights.

This is a weird one. I trained another LoRa on the same settings I used before (10 Epochs 2 Repeats 40 images at 64:32 nets 2e-4, 1e-4) Got fantastic results. I'm wondering what could possibly be the variable that's giving y'all this issue. I have to assume it's a setting seeing as I can train and you can't. The Collab should be a static run env.

What resolutions are you training at, what Runtime are you using and how many images in your dataset? I train at 768-1024 on the T4 with 40 images.

ArmyOfPun1776 commented 6 months ago

I tried lowering the unet and honestly the difference is minimal, its a little better, but i tried your settings with 40 imgs, and its still pretty rough past step 500, though not super saturated. Honestly i wonder if something changed with kohya's scripts in general, and now it simply takes less steps to train, compared to before. How would i go about transferring these settings to kohya's scripts so i cant try training locally?

You could grab the Kohya_ss extension if you use A1111. The process is a little more involved and I'd recommend using a guide to setup the first training run, but all the settings used in the collab are there and then some. I never messed with the straight scripts so I'm useless to you there. sry.

heartbreakergaming commented 6 months ago

Training rates are weird, still pulling data will report back, but running the same data sets through results in very very ugly weights.

This is a weird one. I trained another LoRa on the same settings I used before (10 Epochs 2 Repeats 40 images at 64:32 nets 2e-4, 1e-4) Got fantastic results. I'm wondering what could possibly be the variable that's giving y'all this issue. I have to assume it's a setting seeing as I can train and you can't. The Collab should be a static run env.

What resolutions are you training at, what Runtime are you using and how many images in your dataset? I train at 768-1024 on the T4 with 40 images.

the one i just did was 40 images, 3 repeats, 10 epochs, and the rest of the settings were the same as yours, though the resolution was 640, im trying anime btw, i tried with both the anime and anylora model. but yeah ill try copying the settings to kohya_SS extension and see if it works,

hollowstrawberry commented 6 months ago

I've been training loras for a few days now, both 1.5 and XL, and it seems to work fine with the default settings.

I won't say you're doing something wrong if you encounter this issue, but there may be some unknown factor into play here.

ArmyOfPun1776 commented 6 months ago

I figured out my issue with the XL LoRa.
My LoRa Models have sequential names so I usually "hot swap" them in the prompt. This was doing something funny with A1111 where it was trying to load the new LoRa but was keeping the data for the previous, making a mess of the data. If I create a whole new prompt with the LoRa included it works just fine. It's a strange issue as this isn't a problem I've had with 1.x LoRa "hot swapping" but it is what it is I guess. Just another tick to the case I'm making with myself to name the LoRa Models independently instead of sequentially.

Diablomarv commented 6 months ago

For some reason the nvidia downloads are stalling around 1mb/s so it takes forever to finish the install and I get this: Screenshot 2024-04-02 154450 Do you guys use no GPU to install everything, let it fail due to no GPU and then set GPU or do you just let it waste 20-40 minutes the first run? Also, is there any way to keep the downloaded files so you don't have to re-download them every time you start a new session?

🏭 Installing dependencies...

Cloning into '/content/kohya-trainer'... remote: Enumerating objects: 6262, done. remote: Counting objects: 100% (3063/3063), done. remote: Compressing objects: 100% (485/485), done. remote: Total 6262 (delta 2819), reused 2666 (delta 2577), pack-reused 3199 Receiving objects: 100% (6262/6262), 9.38 MiB | 15.74 MiB/s, done. Resolving deltas: 100% (4455/4455), done. HEAD is now at 9a67e0d Merge pull request #610 from lubobill1990/patch-1 45 packages can be upgraded. Run 'apt list --upgradable' to see them. The following additional packages will be installed: libaria2-0 libc-ares2 The following NEW packages will be installed: aria2 libaria2-0 libc-ares2 0 upgraded, 3 newly installed, 0 to remove and 45 not upgraded. Need to get 1,513 kB of archives. After this operation, 5,441 kB of additional disk space will be used. Selecting previously unselected package libc-ares2:amd64. (Reading database ... 121753 files and directories currently installed.) Preparing to unpack .../libc-ares2_1.18.1-1ubuntu0.22.04.3_amd64.deb ... Unpacking libc-ares2:amd64 (1.18.1-1ubuntu0.22.04.3) ... Selecting previously unselected package libaria2-0:amd64. Preparing to unpack .../libaria2-0_1.36.0-1_amd64.deb ... Unpacking libaria2-0:amd64 (1.36.0-1) ... Selecting previously unselected package aria2. Preparing to unpack .../aria2_1.36.0-1_amd64.deb ... Unpacking aria2 (1.36.0-1) ... Setting up libc-ares2:amd64 (1.18.1-1ubuntu0.22.04.3) ... Setting up libaria2-0:amd64 (1.36.0-1) ... Setting up aria2 (1.36.0-1) ... Processing triggers for man-db (2.10.2-1) ... Processing triggers for libc-bin (2.35-0ubuntu3.4) ... /sbin/ldconfig.real: /usr/local/lib/libtbb.so.12 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_5.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_0.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc_proxy.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind.so.3 is not a symbolic link

Collecting accelerate==0.15.0 Downloading accelerate-0.15.0-py3-none-any.whl (191 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 191.5/191.5 kB 1.8 MB/s eta 0:00:00 Collecting diffusers==0.10.2 Downloading diffusers-0.10.2-py3-none-any.whl (503 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 503.1/503.1 kB 8.6 MB/s eta 0:00:00 Collecting transformers==4.26.0 Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 27.3 MB/s eta 0:00:00 Collecting bitsandbytes==0.41.3.post2 Downloading bitsandbytes-0.41.3.post2-py3-none-any.whl (92.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92.6/92.6 MB 949.9 kB/s eta 0:00:00 Requirement already satisfied: opencv-python==4.8.0.76 in /usr/local/lib/python3.10/dist-packages (4.8.0.76) Requirement already satisfied: tensorflow in /usr/local/lib/python3.10/dist-packages (2.15.0) Collecting torchvision==0.16.0 Downloading torchvision-0.16.0-cp310-cp310-manylinux1_x86_64.whl (6.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.9/6.9 MB 1.0 MB/s eta 0:00:00 Collecting torchtext==0.16.0 Downloading torchtext-0.16.0-cp310-cp310-manylinux1_x86_64.whl (2.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 1.3 MB/s eta 0:00:00 Collecting torchaudio==2.1.0 Downloading torchaudio-2.1.0-cp310-cp310-manylinux1_x86_64.whl (3.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 1.9 MB/s eta 0:00:00 Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate==0.15.0) (1.25.2) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate==0.15.0) (24.0) Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate==0.15.0) (5.9.5) Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate==0.15.0) (6.0.1) Requirement already satisfied: torch>=1.4.0 in /usr/local/lib/python3.10/dist-packages (from accelerate==0.15.0) (2.2.1+cu121) Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.10/dist-packages (from diffusers==0.10.2) (7.1.0) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from diffusers==0.10.2) (3.13.3) Requirement already satisfied: huggingface-hub>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from diffusers==0.10.2) (0.20.3) Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from diffusers==0.10.2) (2023.12.25) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from diffusers==0.10.2) (2.31.0) Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from diffusers==0.10.2) (9.4.0) Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.26.0) Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 1.4 MB/s eta 0:00:00 Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers==4.26.0) (4.66.2) Collecting torch>=1.4.0 (from accelerate==0.15.0) Downloading torch-2.1.0-cp310-cp310-manylinux1_x86_64.whl (670.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 670.2/670.2 MB 827.5 kB/s eta 0:00:00 Collecting torchdata==0.7.0 (from torchtext==0.16.0) Downloading torchdata-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.7/4.7 MB 1.3 MB/s eta 0:00:00 Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.4.0->accelerate==0.15.0) (4.10.0) Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.4.0->accelerate==0.15.0) (1.12) Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.4.0->accelerate==0.15.0) (3.2.1) Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.4.0->accelerate==0.15.0) (3.1.3) Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.4.0->accelerate==0.15.0) (2023.6.0) Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 1.6 MB/s eta 0:00:00 Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 kB 1.4 MB/s eta 0:00:00 Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 1.0 MB/s eta 0:00:00 Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 731.7/731.7 MB 1.0 MB/s eta 0:00:00 Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 973.8 kB/s eta 0:00:00 Collecting nvidia-cufft-cu12==11.0.2.54 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 1.2 MB/s eta 0:00:00 Collecting nvidia-curand-cu12==10.3.2.106 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 2.4 MB/s eta 0:00:00 Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 1.3 MB/s eta 0:00:00 Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 1.6 MB/s eta 0:00:00 Collecting nvidia-nccl-cu12==2.18.1 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_nccl_cu12-2.18.1-py3-none-manylinux1_x86_64.whl (209.8 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.8/209.8 MB 1.5 MB/s eta 0:00:00 Collecting nvidia-nvtx-cu12==12.1.105 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 kB 1.8 MB/s eta 0:00:00 Collecting triton==2.1.0 (from torch>=1.4.0->accelerate==0.15.0) Downloading triton-2.1.0-0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89.2/89.2 MB 2.1 MB/s eta 0:00:00 Requirement already satisfied: urllib3>=1.25 in /usr/local/lib/python3.10/dist-packages (from torchdata==0.7.0->torchtext==0.16.0) (2.0.7) Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_nvjitlink_cu12-12.4.99-py3-none-manylinux2014_x86_64.whl (21.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.1/21.1 MB 1.8 MB/s eta 0:00:00 Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.4.0) Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.6.3) Requirement already satisfied: flatbuffers>=23.5.26 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (24.3.25) Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.5.4) Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.2.0) Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.9.0) Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (18.1.1) Requirement already satisfied: ml-dtypes~=0.2.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.2.0) Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.3.0) Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.20.3) Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from tensorflow) (67.7.2) Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.16.0) Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.4.0) Requirement already satisfied: wrapt<1.15,>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.14.1) Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.36.0) Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.62.1) Requirement already satisfied: tensorboard<2.16,>=2.15 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.15.2) Requirement already satisfied: tensorflow-estimator<2.16,>=2.15.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.15.0) Requirement already satisfied: keras<2.16,>=2.15.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.15.0) Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from astunparse>=1.6.0->tensorflow) (0.43.0) Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (2.27.0) Requirement already satisfied: google-auth-oauthlib<2,>=0.5 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (1.2.0) Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (3.6) Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (0.7.2) Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (3.0.1) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers==0.10.2) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers==0.10.2) (3.6) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers==0.10.2) (2024.2.2) Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata->diffusers==0.10.2) (3.18.1) Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) (5.3.3) Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) (0.4.0) Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) (4.9) Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow) (1.3.1) Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=1.0.1->tensorboard<2.16,>=2.15->tensorflow) (2.1.5) Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.4.0->accelerate==0.15.0) (1.3.0) Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) (0.6.0) Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow) (3.2.2) Installing collected packages: tokenizers, bitsandbytes, triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, nvidia-cusparse-cu12, nvidia-cudnn-cu12, transformers, nvidia-cusolver-cu12, diffusers, torch, torchvision, torchdata, torchaudio, accelerate, torchtext Attempting uninstall: tokenizers Found existing installation: tokenizers 0.15.2 Uninstalling tokenizers-0.15.2: Successfully uninstalled tokenizers-0.15.2 Attempting uninstall: triton Found existing installation: triton 2.2.0 Uninstalling triton-2.2.0: Successfully uninstalled triton-2.2.0 Attempting uninstall: transformers Found existing installation: transformers 4.38.2 Uninstalling transformers-4.38.2: Successfully uninstalled transformers-4.38.2 Attempting uninstall: torch Found existing installation: torch 2.2.1+cu121 Uninstalling torch-2.2.1+cu121: Successfully uninstalled torch-2.2.1+cu121 Attempting uninstall: torchvision Found existing installation: torchvision 0.17.1+cu121 Uninstalling torchvision-0.17.1+cu121: Successfully uninstalled torchvision-0.17.1+cu121 Attempting uninstall: torchdata Found existing installation: torchdata 0.7.1 Uninstalling torchdata-0.7.1: Successfully uninstalled torchdata-0.7.1 Attempting uninstall: torchaudio Found existing installation: torchaudio 2.2.1+cu121 Uninstalling torchaudio-2.2.1+cu121: Successfully uninstalled torchaudio-2.2.1+cu121 Attempting uninstall: torchtext Found existing installation: torchtext 0.17.1 Uninstalling torchtext-0.17.1: Successfully uninstalled torchtext-0.17.1 Successfully installed accelerate-0.15.0 bitsandbytes-0.41.3.post2 diffusers-0.10.2 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.18.1 nvidia-nvjitlink-cu12-12.4.99 nvidia-nvtx-cu12-12.1.105 tokenizers-0.13.3 torch-2.1.0 torchaudio-2.1.0 torchdata-0.7.0 torchtext-0.16.0 torchvision-0.16.0 transformers-4.26.0 triton-2.1.0 Requirement already satisfied: toml==0.10.2 in /usr/local/lib/python3.10/dist-packages (0.10.2) Collecting ftfy==6.1.1 Downloading ftfy-6.1.1-py3-none-any.whl (53 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 1.1 MB/s eta 0:00:00 Collecting einops==0.6.0 Downloading einops-0.6.0-py3-none-any.whl (41 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41.6/41.6 kB 3.1 MB/s eta 0:00:00 Collecting timm==0.6.12 Downloading timm-0.6.12-py3-none-any.whl (549 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 549.1/549.1 kB 6.3 MB/s eta 0:00:00 Collecting fairscale==0.4.13 Downloading fairscale-0.4.13.tar.gz (266 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 266.3/266.3 kB 23.4 MB/s eta 0:00:00 Installing build dependencies ... done

After this point it goes back to high download speeds again, but I've had this happen many times lately.

ArmyOfPun1776 commented 6 months ago

For some reason the nvidia downloads are stalling around 1mb/s so it takes forever to finish the install and I get this: Screenshot 2024-04-02 154450 Do you guys use no GPU to install everything, let it fail due to no GPU and then set GPU or do you just let it waste 20-40 minutes the first run? Also, is there any way to keep the downloaded files so you don't have to re-download them every time you start a new session?

🏭 Installing dependencies...

Cloning into '/content/kohya-trainer'... remote: Enumerating objects: 6262, done. remote: Counting objects: 100% (3063/3063), done. remote: Compressing objects: 100% (485/485), done. remote: Total 6262 (delta 2819), reused 2666 (delta 2577), pack-reused 3199 Receiving objects: 100% (6262/6262), 9.38 MiB | 15.74 MiB/s, done. Resolving deltas: 100% (4455/4455), done. HEAD is now at 9a67e0d Merge pull request #610 from lubobill1990/patch-1 45 packages can be upgraded. Run 'apt list --upgradable' to see them. The following additional packages will be installed: libaria2-0 libc-ares2 The following NEW packages will be installed: aria2 libaria2-0 libc-ares2 0 upgraded, 3 newly installed, 0 to remove and 45 not upgraded. Need to get 1,513 kB of archives. After this operation, 5,441 kB of additional disk space will be used. Selecting previously unselected package libc-ares2:amd64. (Reading database ... 121753 files and directories currently installed.) Preparing to unpack .../libc-ares2_1.18.1-1ubuntu0.22.04.3_amd64.deb ... Unpacking libc-ares2:amd64 (1.18.1-1ubuntu0.22.04.3) ... Selecting previously unselected package libaria2-0:amd64. Preparing to unpack .../libaria2-0_1.36.0-1_amd64.deb ... Unpacking libaria2-0:amd64 (1.36.0-1) ... Selecting previously unselected package aria2. Preparing to unpack .../aria2_1.36.0-1_amd64.deb ... Unpacking aria2 (1.36.0-1) ... Setting up libc-ares2:amd64 (1.18.1-1ubuntu0.22.04.3) ... Setting up libaria2-0:amd64 (1.36.0-1) ... Setting up aria2 (1.36.0-1) ... Processing triggers for man-db (2.10.2-1) ... Processing triggers for libc-bin (2.35-0ubuntu3.4) ... /sbin/ldconfig.real: /usr/local/lib/libtbb.so.12 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_5.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind_2_0.so.3 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbmalloc_proxy.so.2 is not a symbolic link

/sbin/ldconfig.real: /usr/local/lib/libtbbbind.so.3 is not a symbolic link

Collecting accelerate==0.15.0 Downloading accelerate-0.15.0-py3-none-any.whl (191 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 191.5/191.5 kB 1.8 MB/s eta 0:00:00 Collecting diffusers==0.10.2 Downloading diffusers-0.10.2-py3-none-any.whl (503 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 503.1/503.1 kB 8.6 MB/s eta 0:00:00 Collecting transformers==4.26.0 Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 27.3 MB/s eta 0:00:00 Collecting bitsandbytes==0.41.3.post2 Downloading bitsandbytes-0.41.3.post2-py3-none-any.whl (92.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92.6/92.6 MB 949.9 kB/s eta 0:00:00 Requirement already satisfied: opencv-python==4.8.0.76 in /usr/local/lib/python3.10/dist-packages (4.8.0.76) Requirement already satisfied: tensorflow in /usr/local/lib/python3.10/dist-packages (2.15.0) Collecting torchvision==0.16.0 Downloading torchvision-0.16.0-cp310-cp310-manylinux1_x86_64.whl (6.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.9/6.9 MB 1.0 MB/s eta 0:00:00 Collecting torchtext==0.16.0 Downloading torchtext-0.16.0-cp310-cp310-manylinux1_x86_64.whl (2.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 1.3 MB/s eta 0:00:00 Collecting torchaudio==2.1.0 Downloading torchaudio-2.1.0-cp310-cp310-manylinux1_x86_64.whl (3.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.3/3.3 MB 1.9 MB/s eta 0:00:00 Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate==0.15.0) (1.25.2) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate==0.15.0) (24.0) Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate==0.15.0) (5.9.5) Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate==0.15.0) (6.0.1) Requirement already satisfied: torch>=1.4.0 in /usr/local/lib/python3.10/dist-packages (from accelerate==0.15.0) (2.2.1+cu121) Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.10/dist-packages (from diffusers==0.10.2) (7.1.0) Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from diffusers==0.10.2) (3.13.3) Requirement already satisfied: huggingface-hub>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from diffusers==0.10.2) (0.20.3) Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from diffusers==0.10.2) (2023.12.25) Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from diffusers==0.10.2) (2.31.0) Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from diffusers==0.10.2) (9.4.0) Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.26.0) Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 1.4 MB/s eta 0:00:00 Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers==4.26.0) (4.66.2) Collecting torch>=1.4.0 (from accelerate==0.15.0) Downloading torch-2.1.0-cp310-cp310-manylinux1_x86_64.whl (670.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 670.2/670.2 MB 827.5 kB/s eta 0:00:00 Collecting torchdata==0.7.0 (from torchtext==0.16.0) Downloading torchdata-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.7/4.7 MB 1.3 MB/s eta 0:00:00 Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch>=1.4.0->accelerate==0.15.0) (4.10.0) Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.4.0->accelerate==0.15.0) (1.12) Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.4.0->accelerate==0.15.0) (3.2.1) Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.4.0->accelerate==0.15.0) (3.1.3) Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=1.4.0->accelerate==0.15.0) (2023.6.0) Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 1.6 MB/s eta 0:00:00 Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 kB 1.4 MB/s eta 0:00:00 Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 1.0 MB/s eta 0:00:00 Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 731.7/731.7 MB 1.0 MB/s eta 0:00:00 Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 973.8 kB/s eta 0:00:00 Collecting nvidia-cufft-cu12==11.0.2.54 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 1.2 MB/s eta 0:00:00 Collecting nvidia-curand-cu12==10.3.2.106 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 2.4 MB/s eta 0:00:00 Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 1.3 MB/s eta 0:00:00 Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 1.6 MB/s eta 0:00:00 Collecting nvidia-nccl-cu12==2.18.1 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_nccl_cu12-2.18.1-py3-none-manylinux1_x86_64.whl (209.8 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.8/209.8 MB 1.5 MB/s eta 0:00:00 Collecting nvidia-nvtx-cu12==12.1.105 (from torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 kB 1.8 MB/s eta 0:00:00 Collecting triton==2.1.0 (from torch>=1.4.0->accelerate==0.15.0) Downloading triton-2.1.0-0-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 89.2/89.2 MB 2.1 MB/s eta 0:00:00 Requirement already satisfied: urllib3>=1.25 in /usr/local/lib/python3.10/dist-packages (from torchdata==0.7.0->torchtext==0.16.0) (2.0.7) Collecting nvidia-nvjitlink-cu12 (from nvidia-cusolver-cu12==11.4.5.107->torch>=1.4.0->accelerate==0.15.0) Downloading nvidia_nvjitlink_cu12-12.4.99-py3-none-manylinux2014_x86_64.whl (21.1 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.1/21.1 MB 1.8 MB/s eta 0:00:00 Requirement already satisfied: absl-py>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.4.0) Requirement already satisfied: astunparse>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.6.3) Requirement already satisfied: flatbuffers>=23.5.26 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (24.3.25) Requirement already satisfied: gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.5.4) Requirement already satisfied: google-pasta>=0.1.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.2.0) Requirement already satisfied: h5py>=2.9.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.9.0) Requirement already satisfied: libclang>=13.0.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (18.1.1) Requirement already satisfied: ml-dtypes~=0.2.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.2.0) Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.3.0) Requirement already satisfied: protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (3.20.3) Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from tensorflow) (67.7.2) Requirement already satisfied: six>=1.12.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.16.0) Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.4.0) Requirement already satisfied: wrapt<1.15,>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.14.1) Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (0.36.0) Requirement already satisfied: grpcio<2.0,>=1.24.3 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (1.62.1) Requirement already satisfied: tensorboard<2.16,>=2.15 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.15.2) Requirement already satisfied: tensorflow-estimator<2.16,>=2.15.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.15.0) Requirement already satisfied: keras<2.16,>=2.15.0 in /usr/local/lib/python3.10/dist-packages (from tensorflow) (2.15.0) Requirement already satisfied: wheel<1.0,>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from astunparse>=1.6.0->tensorflow) (0.43.0) Requirement already satisfied: google-auth<3,>=1.6.3 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (2.27.0) Requirement already satisfied: google-auth-oauthlib<2,>=0.5 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (1.2.0) Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (3.6) Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (0.7.2) Requirement already satisfied: werkzeug>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from tensorboard<2.16,>=2.15->tensorflow) (3.0.1) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers==0.10.2) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers==0.10.2) (3.6) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->diffusers==0.10.2) (2024.2.2) Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata->diffusers==0.10.2) (3.18.1) Requirement already satisfied: cachetools<6.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) (5.3.3) Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) (0.4.0) Requirement already satisfied: rsa<5,>=3.1.4 in /usr/local/lib/python3.10/dist-packages (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) (4.9) Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow) (1.3.1) Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=1.0.1->tensorboard<2.16,>=2.15->tensorflow) (2.1.5) Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.4.0->accelerate==0.15.0) (1.3.0) Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in /usr/local/lib/python3.10/dist-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow) (0.6.0) Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.10/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow) (3.2.2) Installing collected packages: tokenizers, bitsandbytes, triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, nvidia-cusparse-cu12, nvidia-cudnn-cu12, transformers, nvidia-cusolver-cu12, diffusers, torch, torchvision, torchdata, torchaudio, accelerate, torchtext Attempting uninstall: tokenizers Found existing installation: tokenizers 0.15.2 Uninstalling tokenizers-0.15.2: Successfully uninstalled tokenizers-0.15.2 Attempting uninstall: triton Found existing installation: triton 2.2.0 Uninstalling triton-2.2.0: Successfully uninstalled triton-2.2.0 Attempting uninstall: transformers Found existing installation: transformers 4.38.2 Uninstalling transformers-4.38.2: Successfully uninstalled transformers-4.38.2 Attempting uninstall: torch Found existing installation: torch 2.2.1+cu121 Uninstalling torch-2.2.1+cu121: Successfully uninstalled torch-2.2.1+cu121 Attempting uninstall: torchvision Found existing installation: torchvision 0.17.1+cu121 Uninstalling torchvision-0.17.1+cu121: Successfully uninstalled torchvision-0.17.1+cu121 Attempting uninstall: torchdata Found existing installation: torchdata 0.7.1 Uninstalling torchdata-0.7.1: Successfully uninstalled torchdata-0.7.1 Attempting uninstall: torchaudio Found existing installation: torchaudio 2.2.1+cu121 Uninstalling torchaudio-2.2.1+cu121: Successfully uninstalled torchaudio-2.2.1+cu121 Attempting uninstall: torchtext Found existing installation: torchtext 0.17.1 Uninstalling torchtext-0.17.1: Successfully uninstalled torchtext-0.17.1 Successfully installed accelerate-0.15.0 bitsandbytes-0.41.3.post2 diffusers-0.10.2 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-8.9.2.26 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.18.1 nvidia-nvjitlink-cu12-12.4.99 nvidia-nvtx-cu12-12.1.105 tokenizers-0.13.3 torch-2.1.0 torchaudio-2.1.0 torchdata-0.7.0 torchtext-0.16.0 torchvision-0.16.0 transformers-4.26.0 triton-2.1.0 Requirement already satisfied: toml==0.10.2 in /usr/local/lib/python3.10/dist-packages (0.10.2) Collecting ftfy==6.1.1 Downloading ftfy-6.1.1-py3-none-any.whl (53 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 1.1 MB/s eta 0:00:00 Collecting einops==0.6.0 Downloading einops-0.6.0-py3-none-any.whl (41 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 41.6/41.6 kB 3.1 MB/s eta 0:00:00 Collecting timm==0.6.12 Downloading timm-0.6.12-py3-none-any.whl (549 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 549.1/549.1 kB 6.3 MB/s eta 0:00:00 Collecting fairscale==0.4.13 Downloading fairscale-0.4.13.tar.gz (266 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 266.3/266.3 kB 23.4 MB/s eta 0:00:00 Installing build dependencies ... done

After this point it goes back to high download speeds again, but I've had this happen many times lately.

I thought about doing that, but I'm not sure if it'll totally kill the runtime since you're disconnecting and reconnecting with a different runtime module.

I've noticed that at certain points in the day collab is slow as ballz. I usually wait till late night here, Philippines, when everyone in the US is at work, or super early morning here, Philippines, when everyone in the US is going to bed. Also, Friday nights and Saturday and Sunday mornings seems to be pretty fast as well.

Good luck

heartbreakergaming commented 6 months ago

I've been training loras for a few days now, both 1.5 and XL, and it seems to work fine with the default settings.

I won't say you're doing something wrong if you encounter this issue, but there may be some unknown factor into play here.

yeah im not sure to be honest, i mean im not changing any of the settings except the repeats, i keep my images multiplied by repeats between 200-300. I havent tried any more lately as ive been working overtime, but one thing that was mentioned that i also noticed was the loss rate is higher than it used to be, it wasnt something i ever paid that much attention too but im not sure how much that affects stuff.

lucaswalkeryoung commented 6 months ago

Sorry - 200 or 300 repeats? With how many images? Unless my (admittedly intermediate) understanding is completely incorrect, that's at least an order of magnitude too high. More images > More epochs > More Repeats. Given 1 image = 1 step, more or less, 200-300 repeats means, what, 30 images over 10 epochs to total 9000 steps?

Try reducing your repeats to 5, your batch size to 1, your learning rate down by an order of magnitude, and then upping the epochs to meet somewhere around 4000-6000 steps. And, as per my own experience, use more images and not more repeats.

dmikey commented 6 months ago

Did 20-40 images with 10 repeats and 10 epochs (default) and it came out perfect. Probably training settings.

heartbreakergaming commented 5 months ago

Sorry - 200 or 300 repeats? With how many images? Unless my (admittedly intermediate) understanding is completely incorrect, that's at least an order of magnitude too high. More images > More epochs > More Repeats. Given 1 image = 1 step, more or less, 200-300 repeats means, what, 30 images over 10 epochs to total 9000 steps?

Try reducing your repeats to 5, your batch size to 1, your learning rate down by an order of magnitude, and then upping the epochs to meet somewhere around 4000-6000 steps. And, as per my own experience, use more images and not more repeats.

thats my bad i mistyped and meant, the images multiplied by repeats between 200-300, so if i have 20 images i do around 10 repeats.

heartbreakergaming commented 5 months ago

Did 20-40 images with 10 repeats and 10 epochs (default) and it came out perfect. Probably training settings.

Yeah a few people have reported it works fine for them, but im not sure what im doing wrong, im leaving everything at default settings, i did 30 images with 10 repeats and 10 epochs recently and by epoch 4 the training started coming out, on anylora, im going to try animefull and see if that does any better.

ArmyOfPun1776 commented 5 months ago

Did 20-40 images with 10 repeats and 10 epochs (default) and it came out perfect. Probably training settings.

Yeah a few people have reported it works fine for them, but im not sure what im doing wrong, im leaving everything at default settings, i did 30 images with 10 repeats and 10 epochs recently and by epoch 4 the training started coming out, on anylora, im going to try animefull and see if that does any better.

Even if you don't have a lot of images: Try reducing the learning rate to 2e-4 and keep the text encoder at 1e-4. I train 40 images consistently with that learning rate and always get great results.

My settings if you want to give them a go:

Also, what kind of model are you trying to train? Anime I'm assuming? Maybe try using a Custom model to train. I doubt that's the issue, but worth a try to test all variables. I haven't used the default models in forever. Nothing against them. There are just better models out there. Then again, it seems like @hollowstrawberry has been training with the defaults; So really couldn't be the issue.

heartbreakergaming commented 5 months ago

Did 20-40 images with 10 repeats and 10 epochs (default) and it came out perfect. Probably training settings.

Yeah a few people have reported it works fine for them, but im not sure what im doing wrong, im leaving everything at default settings, i did 30 images with 10 repeats and 10 epochs recently and by epoch 4 the training started coming out, on anylora, im going to try animefull and see if that does any better.

Even if you don't have a lot of images: Try reducing the learning rate to 2e-4 and keep the text encoder at 1e-4. I train 40 images consistently with that learning rate and always get great results.

My settings if you want to give them a go:

  • 40 images
  • 10 Repeats
  • 10 Epochs
  • 768 (1024 if you're use the A100. Though with only 20 images you might be able to get away with the T1 at 1024.)
  • 2 Batches
  • 2e-4 Learning Rate
  • 1e-4 Texting Encoding
  • 64:32 for the Networks respectively

Also, what kind of model are you trying to train? Anime I'm assuming? Maybe try using a Custom model to train. I doubt that's the issue, but worth a try to test all variables. I haven't used the default models in forever. Nothing against them. There are just better models out there. Then again, it seems like @hollowstrawberry has been training with the defaults; So really couldn't be the issue.

So default settings, and instead of using anylora as i have in the past i used animefull, and it came out great. So maybe just anylora has an issue? Another thing to note is that loss the last few times ive trained were at 0.1 this time it was at 0.04-0.06

heartbreakergaming commented 5 months ago

Did 20-40 images with 10 repeats and 10 epochs (default) and it came out perfect. Probably training settings.

Yeah a few people have reported it works fine for them, but im not sure what im doing wrong, im leaving everything at default settings, i did 30 images with 10 repeats and 10 epochs recently and by epoch 4 the training started coming out, on anylora, im going to try animefull and see if that does any better.

Even if you don't have a lot of images: Try reducing the learning rate to 2e-4 and keep the text encoder at 1e-4. I train 40 images consistently with that learning rate and always get great results.

My settings if you want to give them a go:

  • 40 images
  • 10 Repeats
  • 10 Epochs
  • 768 (1024 if you're use the A100. Though with only 20 images you might be able to get away with the T1 at 1024.)
  • 2 Batches
  • 2e-4 Learning Rate
  • 1e-4 Texting Encoding
  • 64:32 for the Networks respectively

Also, what kind of model are you trying to train? Anime I'm assuming? Maybe try using a Custom model to train. I doubt that's the issue, but worth a try to test all variables. I haven't used the default models in forever. Nothing against them. There are just better models out there. Then again, it seems like @hollowstrawberry has been training with the defaults; So really couldn't be the issue.

Ive tried these settings and i didnt see a difference in terms of being deepfried, instead it simply just didnt really learn the character and outfit that well.

hollowstrawberry commented 5 months ago

Are people still facing this issue as of today?