Closed MariemOualha closed 1 year ago
Hi, I think there is a $
symbol missing in line 13 somehow. When there is a $
symbol, it will take the defined environment variable. Here ENT
stands for python train_dist.py --num_process_per_node $NGPU
as in the original file.
If this is not clear, please post your full train_vae.sh
here.
@ZENGXH How long did you train it? till epoch 8000?
@ZENGXH How long did you train it? till epoch 8000?
on 4 A100, it takes ~10 hours to train on airplane category for 8000 epochs, batch-size 32 per gpu.
@MariemOualha for reference https://github.com/nv-tlabs/LION/issues/31
hello @ZENGXH , i'm trying to train VAE but i got errors when i run the train_vae.sh can you help me !