Open sitatec opened 1 year ago
Thank you!
The example of DreamBooth training in Diffusers seems to have flax version. My training scripts are based on train_dreambooth.py
, so I think you can compare train_dreambooth.py
and train_dreambooth_flax.py
and will find the changes you need.
https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_flax.py
Ok thanks for your fast answer, I will check the diffusers scripts and update here if I manage to run it with flax.
Since train_network.py
use accelerate
to construct training loop, it seems that add --tpu
and --main_training_function
in accelerate launch
can launch a TPU training. The launch command is shown in the screenshot.
To launch the TPU training, some change seems to be necessary in train_network.py
:
if __name__ == '__main__':
parser = argparse.ArgumentParser()
...
args = parser.parse_args()
train(args)
↓ ↓ ↓
def main():
parser = argparse.ArgumentParser()
...
args = parser.parse_args()
train(args)
if __name__ == '__main__':
main()
However, TPU training is too slow compared with GPU because many optimizations are unavailable in TPU environment. 😅
Thanks for your suggestion @Isotr0py,
I already tried this, the only difference is that I used accelerate config
command and selected TPU
as device and main
as main function, I also added the main function, I was able to successfully generate the captions and latents caching step went well but it stuck when training starts, maybe I was passing some parameters to the training script that are not available for TPUs.
Since you were able to train on TPU, that gives me hope, I will retry and let you know the result.
I tried with only the parameters I saw in your screenshot, but it is not working. The script is blocked here for almost an hour
Here is the command I used: NOTE: the training data dir contain caption files in with txt extension.
accelerate launch --tpu --main_training_function="main" "train_network_tpu.py" \
--enable_bucket \
--pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
--train_data_dir="/root/training_data/subject1" \
--resolution="512,512" \
--output_dir="/root/models/subject1" \
--logging_dir="/root/models/log" \
--network_alpha=128 \
--save_model_as="safetensors" \
--network_module="networks.lora" \
--text_encoder_lr="5e-5" \
--unet_lr="0.0001" \
--network_dim=128 \
--output_name="Arikytsya" \
--learning_rate="0.0001" \
--lr_scheduler="constant" \
--train_batch_size=2 \
--max_train_steps=1600 \
--save_every_n_epochs="1" \
--mixed_precision="bf16" \
--save_precision="bf16" \
--seed=42 \
--clip_skip=2
@sitatec I used your command and the TPU training can work normally. So the launch command isn't the cause. Anyway, I tested TPU training on commit f0ae7eea95 successfully with TPU VM v3-8. It seems that some newer commits may cause different errors in TPU training.
It still didn't work for me, did you train this in colab or gcp? Maybe there is a difference in the tpu drivers or something else related to the env.
If it worked for you on colab, can you send me the link to the notebook?
I train this in kaggle instead since I can't access colab's TPU most of the time. My training notebook is on lora-train-tpu. I hope it can help you.
Hum, that's what I thought It was related to the env, I tested on kaggle and it worked, but I couldn't find what TPU software and version is used in kaggle. And as you said training on TPU is toooo slow, I'm not bothering with TPU for lora anymore. On GPU, I was able to train in less than 5 mins but on TPU its almost an hour:
Thank you very much for the effort @Isotr0py.
I was able to make the training a little bit fast (~20 mins) by setting the train batch size to 4 since TPUs have a lot of memory. Training went well, but it stuck when it start saving the model even on kaggle. Currently the code to save the model is like this for ckpt
format:
torch.save(state_dict, file)
I even tried this:
import torch_xla.core.xla_model as xm
...
xm.save(state_dict, file)
but still didn't work, It shows the message save trained model to /kaggle/working/<<output_name>>.ckpt
but stuck there.
I was able to make the training a little bit fast (~20 mins) by setting the train batch size to 4 since TPUs have a lot of memory. Training went well, but it stuck when it start saving the model even on kaggle. Currently the code to save the model is like this for
ckpt
format:torch.save(state_dict, file)
I even tried this:
import torch_xla.core.xla_model as xm ... xm.save(state_dict, file)
but still didn't work, It shows the message
save trained model to /kaggle/working/<<output_name>>.ckpt
but stuck there.
what about accelerator.save_model
?
https://huggingface.co/docs/accelerate/package_reference/accelerator#saving-and-loading
I've been trying to do the sdxl_train.py using TPU but I keep getting this error:
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /usr/local/bin/accelerate:8 in
Any idea of how to fix this? It was using the native trainer colab
Hi, first of all thanks for this amazing work 👍. Is it possible to run the
train_network.py
script on a TPU?I actually tried but it's not working, I even remove the
<xx>.to("cuda")
now I'm not seeing any error but the training stuck after it finishes caching the latens.How can I make this work on TPU?
PS: I'm a software engineer, but I'm new to the machine learning world.