kohya-ss / sd-scripts

Apache License 2.0
4.49k stars 759 forks source link

TPU support #229

Open sitatec opened 1 year ago

sitatec commented 1 year ago

Hi, first of all thanks for this amazing work 👍. Is it possible to run the train_network.py script on a TPU?

I actually tried but it's not working, I even remove the <xx>.to("cuda") now I'm not seeing any error but the training stuck after it finishes caching the latens.

How can I make this work on TPU?

PS: I'm a software engineer, but I'm new to the machine learning world.

kohya-ss commented 1 year ago

Thank you!

The example of DreamBooth training in Diffusers seems to have flax version. My training scripts are based on train_dreambooth.py, so I think you can compare train_dreambooth.py and train_dreambooth_flax.py and will find the changes you need. https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_flax.py

sitatec commented 1 year ago

Ok thanks for your fast answer, I will check the diffusers scripts and update here if I manage to run it with flax.

Isotr0py commented 1 year ago

Since train_network.py use accelerate to construct training loop, it seems that add --tpu and --main_training_function in accelerate launch can launch a TPU training. The launch command is shown in the screenshot.

QQ图片20230228205538

To launch the TPU training, some change seems to be necessary in train_network.py:

if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  ...
  args = parser.parse_args()
  train(args)

↓ ↓ ↓

def main():
  parser = argparse.ArgumentParser()
  ...
  args = parser.parse_args()
  train(args)

if __name__ == '__main__':
  main()

However, TPU training is too slow compared with GPU because many optimizations are unavailable in TPU environment. 😅

sitatec commented 1 year ago

Thanks for your suggestion @Isotr0py, I already tried this, the only difference is that I used accelerate config command and selected TPU as device and main as main function, I also added the main function, I was able to successfully generate the captions and latents caching step went well but it stuck when training starts, maybe I was passing some parameters to the training script that are not available for TPUs. Since you were able to train on TPU, that gives me hope, I will retry and let you know the result.

sitatec commented 1 year ago

I tried with only the parameters I saw in your screenshot, but it is not working. The script is blocked here for almost an hour

Screenshot 2023-03-01 at 20 21 38

Here is the command I used: NOTE: the training data dir contain caption files in with txt extension.

accelerate launch --tpu --main_training_function="main" "train_network_tpu.py" \
  --enable_bucket \
  --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
  --train_data_dir="/root/training_data/subject1" \
  --resolution="512,512" \
  --output_dir="/root/models/subject1" \
  --logging_dir="/root/models/log" \
  --network_alpha=128 \
  --save_model_as="safetensors" \
  --network_module="networks.lora" \
  --text_encoder_lr="5e-5" \
  --unet_lr="0.0001" \
  --network_dim=128 \
  --output_name="Arikytsya" \
  --learning_rate="0.0001" \
  --lr_scheduler="constant" \
  --train_batch_size=2 \
  --max_train_steps=1600 \
  --save_every_n_epochs="1" \
  --mixed_precision="bf16" \
  --save_precision="bf16" \
  --seed=42 \
  --clip_skip=2 
Isotr0py commented 1 year ago

@sitatec I used your command and the TPU training can work normally. So the launch command isn't the cause. Y3`_01AQ4} )(0L~ 4UA}@H Anyway, I tested TPU training on commit f0ae7eea95 successfully with TPU VM v3-8. It seems that some newer commits may cause different errors in TPU training.

sitatec commented 1 year ago

It still didn't work for me, did you train this in colab or gcp? Maybe there is a difference in the tpu drivers or something else related to the env.

If it worked for you on colab, can you send me the link to the notebook?

Isotr0py commented 1 year ago

I train this in kaggle instead since I can't access colab's TPU most of the time. My training notebook is on lora-train-tpu. I hope it can help you.

sitatec commented 1 year ago

Hum, that's what I thought It was related to the env, I tested on kaggle and it worked, but I couldn't find what TPU software and version is used in kaggle. And as you said training on TPU is toooo slow, I'm not bothering with TPU for lora anymore. On GPU, I was able to train in less than 5 mins but on TPU its almost an hour:

Screenshot 2023-03-05 at 11 29 11
sitatec commented 1 year ago

Thank you very much for the effort @Isotr0py.

sitatec commented 1 year ago

I was able to make the training a little bit fast (~20 mins) by setting the train batch size to 4 since TPUs have a lot of memory. Training went well, but it stuck when it start saving the model even on kaggle. Currently the code to save the model is like this for ckpt format:

torch.save(state_dict, file)

I even tried this:

import torch_xla.core.xla_model as xm
...
xm.save(state_dict, file)

but still didn't work, It shows the message save trained model to /kaggle/working/<<output_name>>.ckpt but stuck there.

Screenshot 2023-03-08 at 18 44 38
drimeF0 commented 5 months ago

I was able to make the training a little bit fast (~20 mins) by setting the train batch size to 4 since TPUs have a lot of memory. Training went well, but it stuck when it start saving the model even on kaggle. Currently the code to save the model is like this for ckpt format:

torch.save(state_dict, file)

I even tried this:

import torch_xla.core.xla_model as xm
...
xm.save(state_dict, file)

but still didn't work, It shows the message save trained model to /kaggle/working/<<output_name>>.ckpt but stuck there.

Screenshot 2023-03-08 at 18 44 38

what about accelerator.save_model? https://huggingface.co/docs/accelerate/package_reference/accelerator#saving-and-loading

domochevisk commented 4 months ago

I've been trying to do the sdxl_train.py using TPU but I keep getting this error:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /usr/local/bin/accelerate:8 in │ │ │ │ 5 from accelerate.commands.accelerate_cli import main │ │ 6 if name == 'main': │ │ 7 │ sys.argv[0] = re.sub(r'(-script.pyw|.exe)?$', '', sys.argv[0]) │ │ ❱ 8 │ sys.exit(main()) │ │ 9 │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py:45 in main │ │ │ │ 42 │ │ exit(1) │ │ 43 │ │ │ 44 │ # Run │ │ ❱ 45 │ args.func(args) │ │ 46 │ │ 47 │ │ 48 if name == "main": │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:914 in launch_command │ │ │ │ 911 │ │ if args.tpu_use_cluster: │ │ 912 │ │ │ tpu_pod_launcher(args) │ │ 913 │ │ else: │ │ ❱ 914 │ │ │ tpu_launcher(args) │ │ 915 │ elif defaults is not None and defaults.compute_environment == ComputeEnvironment.AMA │ │ 916 │ │ sagemaker_launcher(defaults, args) │ │ 917 │ else: │ │ │ │ /usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py:659 in tpu_launcher │ │ │ │ 656 │ │ 657 │ │ 658 def tpu_launcher(args): │ │ ❱ 659 │ import torch_xla.distributed.xla_multiprocessing as xmp │ │ 660 │ │ │ 661 │ if args.no_python: │ │ 662 │ │ raise ValueError("--no_python cannot be used with TPU launcher") │ │ │ │ /usr/local/lib/python3.10/dist-packages/torch_xla/init.py:142 in │ │ │ │ 139 from ._patched_functions import _apply_patches │ │ 140 from .version import version │ │ 141 │ │ ❱ 142 import _XLAC │ │ 143 │ │ 144 _found_libtpu = _setup_tpu_vm_library_path() │ │ 145 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ImportError: /usr/local/lib/python3.10/dist-packages/_XLAC.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c109TupleTypeC1ESt6vectorINS_4Type24SingletonOrSharedTypePtrIS2_EESaIS4_EESt8optionalINS_13Quali fiedNameEESt10shared_ptrINS_14FunctionSchemaEE

Any idea of how to fix this? It was using the native trainer colab