Closed osushilover closed 2 years ago
You should have four files, for example:
Then you need to add in command:
--which_epoch 61000
(in my example, you need add yours)
--continue_train True
Thanks for your very clear guidance! I will add the option and resume training.
However, it took me about 10 hours to iterate to 60,000 epochs, even using V100 on colab. If anyone is willing to share their trained models on model zoo, I think it would be faster for me to resume training with them. Or I can take that role. Please let me know what you think, @netrunner-exe .
Thanks for your very clear guidance! I will add the option and resume training.
However, it took me about 10 hours to iterate to 60,000 epochs, even using V100 on colab. If anyone is willing to share their trained models on model zoo, I think it would be faster for me to resume training with them. Or I can take that role.
What batch_size you used for training? If you have Colab Pro and Tesla V100 you can try higher batch_size. If you want you can mail me netrunner.exe@gmail.com - I'll write you a couple of good tricks for training in Google Colab :)
Training under Colab Pro's high memory limit was 4 batch sizes, even with the Tesla V100. Even if the batch size was increased, the message "Runtime error: CUDA out of memory..." would appear, so making training with the recommended batch size of 16 is impossible.
I hope I can be of some help to you guys. I will get back to you soon, @netrunner-exe . Thank you.
Training under Colab Pro's high memory limit was 4 batch sizes, even with the Tesla V100. Even if the batch size was increased, the message "Runtime error: CUDA out of memory..." would appear, so making training with the recommended batch size of 16 is impossible.
I hope I can be of some help to you guys. I will get back to you soon, @netrunner-exe . Thank you.
What dataset you use and what size you train? It's strange, I managed to run the code with BatchSize 22
on Tesla T4 in Google Colab and 17
on Tesla K80. Are you sure you have a GPU enabled in the settings? Try to run !nvidia-smi --query-gpu=gpu_name,driver_version,memory.total --format=csv
and post the output
That's great! But I'm standing here.
That's great! But I'm standing here.
What dataset you use and what size you train?
Indeed! It's 512! It is possible that the size of the dataset I am dealing with is too large. This is it. And thank you for sharing the assets.
Maybe train 512 with batch size larger than 8 is not possible even on v100. For advanced training this dataset is small, it is best to train on a full Vggface-2 of course. Unfortunately I don't have croped and aligned 512 VGGFACE2. Have you tried training on 224 provided in readme?
I didn't know that you don't have croped and aligned 512 VGGFACE2. Though I have not tried 224 as I expect a high quality FaceSwap, Shall I make it in 224 for you all?
I didn't know that you don't have croped and aligned 512 VGGFACE2. Though I have not tried 224 as I expect a high quality FaceSwap, Shall I make it in 224 for you all?
Probably you misunderstood me. Those datasets that I posted are croped and aligned to 512 and 224 but now I realized that their size (around 16000 images vs full VGGFace-2 around 600000) is very small for a normal training. Maybe I'll delete them later, because now I don't see much point in training with them. Perhaps someone here will share their experience of training 512 or the full version of croped and aligned 512 VGGFace-2.
I want to share my experience of training in Google Colab. Some tips that i used:
You must have at least 10 GB of free space on Google Drive to save checkpoints.
Open this link (its official 224 dataset from @neuralchen) and create a shortcut on your Google drive - this will not take up any space on it but will allow you to copy the dataset from Gdrive to Colab and no need to download it from wget
Then after installing depencies i use this code:
# Mount Gdrive
from google.colab import drive
drive.mount('/content/drive')
%cd /content/SimSwap
!mkdir /content/SimSwap/datasets &> /dev/null !tar -xzf "/content/drive/MyDrive/vggface2_crop_arcfacealign_224.tar" --directory ./datasets
New training 224:
%cd /content/SimSwap
name = "simswap_224_test" dataset = "/content/SimSwap/datasets/vggface2_crop_arcfacealign_224" sample_freq = 1000 model_freq = 1000 batchSize = 17
checkpoints_dir = "/content/drive/MyDrive/SimSwap/checkpoints"
!python -W ignore train.py --name {name} --gpu_ids 0 --use_tensorboard False --batchSize {batchSize} --model_freq {model_freq} --sample_freq {sample_freq} --checkpoints_dir {checkpoints_dir} --dataset {dataset} --Gdeep False
Continue train 224:
%cd /content/SimSwap
name = "simswap_224_test" dataset = "/content/SimSwap/datasets/vggface2_crop_arcfacealign_224" batchSize = 17 which_epoch = 57000 sample_freq = 1000 model_freq = 1000 checkpoints_dir = "/content/drive/MyDrive/SimSwap/checkpoints" load_pretrain = "/content/drive/MyDrive/SimSwap/checkpoints/simswap_224_test"
!python -W ignore train.py --name {name} --gpu_ids 0 --which_epoch {which_epoch} --batchSize {batchSize} --load_pretrain {load_pretrain} --use_tensorboard False --continue_train True --model_freq {model_freq} --sample_freq {sample_freq} --checkpoints_dir {checkpoints_dir} --dataset {dataset} --Gdeep False
If you catch `Cuda out of memory` - try lower `batchSize` (20, 15, 8, and other). You can experiment with values. On Tesla T4 train 224 works with `batchSize` 22 and K80 - 17. If you use V100 you can try higher value.
All checkpoints will be stored in Google Drive, so even if the session ends, they will remain and you will continue training.
When you continue training. you need to change `which_epoch` value to yours and file `iter.txt` must exist in folder too - the last saved epoch is written to it.
And most importantly - do not forget to keep track of Google Drive so that it does not run out of space! Periodically delete old points when you are sure that new ones have been saved and empty the trash. If this is not done, the space on the Google drive is running out and the checkpoints will no longer be saved to it.
If anyone needs it, here's mine [Colab notebook](https://colab.research.google.com/drive/19JbqRGDa4iqBuwl20pnXzlp0ZiXcAkwM?usp=sharing) which i use
I admire you, @netrunner-exe.
I admire you, @netrunner-exe.
Thank you! It would be really great if you shared your result when you finish the training :) Unfortunately, on the free Google Colab with three accounts, I only reached 60,000 it. with training 224 about a week
I want to share my experience of training in Google Colab. Some tips that i used:
- You must have at least 10 GB of free space on Google Drive to save checkpoints.
- Open this link (its official 224 dataset from @neuralchen) and create a shortcut on your Google drive - this will not take up any space on it but will allow you to copy the dataset from Gdrive to Colab and no need to download it from wget
- Then after installing depencies i use this code:
# Mount Gdrive from google.colab import drive drive.mount('/content/drive') # Copy from Gdrive %cd /content/SimSwap !mkdir /content/SimSwap/datasets &> /dev/null !tar -xzf "/content/drive/MyDrive/vggface2_crop_arcfacealign_224.tar" --directory ./datasets
New training 224:
# Train 224 model # name - Name of the experiment. It decides where to store samples and models # dataset - Path to the face swapping dataset # sample_freq - Frequence for sampling # model_freq - Frequence for saving the model # checkpoints_dir - Models are saved here %cd /content/SimSwap name = "simswap_224_test" dataset = "/content/SimSwap/datasets/vggface2_crop_arcfacealign_224" sample_freq = 1000 model_freq = 1000 batchSize = 17 checkpoints_dir = "/content/drive/MyDrive/SimSwap/checkpoints" !python -W ignore train.py --name {name} --gpu_ids 0 --use_tensorboard False --batchSize {batchSize} --model_freq {model_freq} --sample_freq {sample_freq} --checkpoints_dir {checkpoints_dir} --dataset {dataset} --Gdeep False
Continue train 224:
# Continue train 224 model # name - Name of the experiment. It decides where to store samples and models # dataset - Path to the face swapping dataset # load_pretrain - Load the pretrained model from the specified location # sample_freq - Frequence for sampling # model_freq - Frequence for saving the model # checkpoints_dir - Models are saved here # continue_train - Continue training: load the latest model %cd /content/SimSwap name = "simswap_224_test" dataset = "/content/SimSwap/datasets/vggface2_crop_arcfacealign_224" batchSize = 17 which_epoch = 57000 sample_freq = 1000 model_freq = 1000 checkpoints_dir = "/content/drive/MyDrive/SimSwap/checkpoints" load_pretrain = "/content/drive/MyDrive/SimSwap/checkpoints/simswap_224_test" !python -W ignore train.py --name {name} --gpu_ids 0 --which_epoch {which_epoch} --batchSize {batchSize} --load_pretrain {load_pretrain} --use_tensorboard False --continue_train True --model_freq {model_freq} --sample_freq {sample_freq} --checkpoints_dir {checkpoints_dir} --dataset {dataset} --Gdeep False
If you catch
Cuda out of memory
- try lowerbatchSize
(20, 15, 8, and other). You can experiment with values. On Tesla T4 train 224 works withbatchSize
22 and K80 - 17. If you use V100 you can try higher value.All checkpoints will be stored in Google Drive, so even if the session ends, they will remain and you will continue training. When you continue training. you need to change
which_epoch
value to yours and fileiter.txt
must exist in folder too - the last saved epoch is written to it.And most importantly - do not forget to keep track of Google Drive so that it does not run out of space! Periodically delete old points when you are sure that new ones have been saved and empty the trash. If this is not done, the space on the Google drive is running out and the checkpoints will no longer be saved to it. If anyone needs it, here's mine Colab notebook which i use
Is there a link for the 512 .tar file? I'm confused between the files required. The google drive link for 512 is broken up into like 5 files? 4 numbers seqeunced zip files and one standalone zip file. Can you make sense of how to train for 512 given the exact same setup you gave for 224? I get everything but what to switch out for the "its official 224 dataset from @neuralchen" 512 equivalent?
I want to share my experience of training in Google Colab. Some tips that i used:
- You must have at least 10 GB of free space on Google Drive to save checkpoints.
- Open this link (its official 224 dataset from @neuralchen) and create a shortcut on your Google drive - this will not take up any space on it but will allow you to copy the dataset from Gdrive to Colab and no need to download it from wget
- Then after installing depencies i use this code:
# Mount Gdrive from google.colab import drive drive.mount('/content/drive') # Copy from Gdrive %cd /content/SimSwap !mkdir /content/SimSwap/datasets &> /dev/null !tar -xzf "/content/drive/MyDrive/vggface2_crop_arcfacealign_224.tar" --directory ./datasets
New training 224:
# Train 224 model # name - Name of the experiment. It decides where to store samples and models # dataset - Path to the face swapping dataset # sample_freq - Frequence for sampling # model_freq - Frequence for saving the model # checkpoints_dir - Models are saved here %cd /content/SimSwap name = "simswap_224_test" dataset = "/content/SimSwap/datasets/vggface2_crop_arcfacealign_224" sample_freq = 1000 model_freq = 1000 batchSize = 17 checkpoints_dir = "/content/drive/MyDrive/SimSwap/checkpoints" !python -W ignore train.py --name {name} --gpu_ids 0 --use_tensorboard False --batchSize {batchSize} --model_freq {model_freq} --sample_freq {sample_freq} --checkpoints_dir {checkpoints_dir} --dataset {dataset} --Gdeep False
Continue train 224:
# Continue train 224 model # name - Name of the experiment. It decides where to store samples and models # dataset - Path to the face swapping dataset # load_pretrain - Load the pretrained model from the specified location # sample_freq - Frequence for sampling # model_freq - Frequence for saving the model # checkpoints_dir - Models are saved here # continue_train - Continue training: load the latest model %cd /content/SimSwap name = "simswap_224_test" dataset = "/content/SimSwap/datasets/vggface2_crop_arcfacealign_224" batchSize = 17 which_epoch = 57000 sample_freq = 1000 model_freq = 1000 checkpoints_dir = "/content/drive/MyDrive/SimSwap/checkpoints" load_pretrain = "/content/drive/MyDrive/SimSwap/checkpoints/simswap_224_test" !python -W ignore train.py --name {name} --gpu_ids 0 --which_epoch {which_epoch} --batchSize {batchSize} --load_pretrain {load_pretrain} --use_tensorboard False --continue_train True --model_freq {model_freq} --sample_freq {sample_freq} --checkpoints_dir {checkpoints_dir} --dataset {dataset} --Gdeep False
If you catch
Cuda out of memory
- try lowerbatchSize
(20, 15, 8, and other). You can experiment with values. On Tesla T4 train 224 works withbatchSize
22 and K80 - 17. If you use V100 you can try higher value. All checkpoints will be stored in Google Drive, so even if the session ends, they will remain and you will continue training. When you continue training. you need to changewhich_epoch
value to yours and fileiter.txt
must exist in folder too - the last saved epoch is written to it. And most importantly - do not forget to keep track of Google Drive so that it does not run out of space! Periodically delete old points when you are sure that new ones have been saved and empty the trash. If this is not done, the space on the Google drive is running out and the checkpoints will no longer be saved to it. If anyone needs it, here's mine Colab notebook which i useIs there a link for the 512 .tar file? I'm confused between the files required. The google drive link for 512 is broken up into like 5 files? 4 numbers seqeunced zip files and one standalone zip file. Can you make sense of how to train for 512 given the exact same setup you gave for 224? I get everything but what to switch out for the "its official 224 dataset from @neuralchen" 512 equivalent?
What link are you talking about? In the answer that you reposted there is no hint of training 512, etc. If you want to get a normal answer - normally formulate a question.
@netrunner-exe There is some confusion on part which is probably why you are confused.
Try adding a shortcut to this folder on your Google drive, mount it, unpack the archive. If you don't have at least 100 GB free space in Colab - I do not see any point in trying to unpack them (<1000000 files) on Colab hdd.
The full version of the 512 dataset that could be used for training in Colab has not yet been posted. Do you use Pro Colab or free?
I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe
I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe
What BatchSize you using?
I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe
What BatchSize you using?
23 sitting at 15K right now.
I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe
What BatchSize you using?
23 sitting at 15K right now.
Did you try higher BatchSize or when you set higher than 23 you get Cuda out of memory?
I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe
What BatchSize you using?
23 sitting at 15K right now.
Did you try higher BatchSize or when you set higher than 23 you get Cuda out of memory?
It's red lining as is in terms of GPU mem. This is my first pass at it.
I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe
What BatchSize you using?
23 sitting at 15K right now.
Did you try higher BatchSize or when you set higher than 23 you get Cuda out of memory?
It's red lining as is in terms of GPU mem. This is my first pass at it.
If you want - write your mail, I will share my checkpoint, I'm at 81000 with batch size 22. You can continue. To be honest, I'm already sick train with three accounts on Free Colab :)
I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe
What BatchSize you using?
23 sitting at 15K right now.
Did you try higher BatchSize or when you set higher than 23 you get Cuda out of memory?
It's red lining as is in terms of GPU mem. This is my first pass at it.
If you want - write your mail, I will share my checkpoint, I'm at 81000 with batch size 22. You can continue. To be honest, I'm already sick train with three accounts on Free Colab :)
EDIT: Just made an unofficial discord for this https://discord.gg/6aXkTUqr3B You can add me on discord (removed).
( step: 266199, ) G_Loss: 0.798 G_ID: 0.149 G_Rec: 1.728 G_feat_match: 1.402 D_fake: 0.469 D_real: 0.214 D_loss: 0.683 ( step: 266399, ) G_Loss: 1.620 G_ID: 0.148 G_Rec: 1.570 G_feat_match: 1.389 D_fake: 0.152 D_real: 0.330 D_loss: 0.482 ( step: 266599, ) G_Loss: 1.035 G_ID: 0.158 G_Rec: 1.300 G_feat_match: 1.395 D_fake: 0.237 D_real: 0.688 D_loss: 0.926 ( step: 266799, ) G_Loss: 0.986 G_ID: 0.169 G_Rec: 1.261 G_feat_match: 1.311 D_fake: 0.144 D_real: 0.351 D_loss: 0.494 ( step: 266999, ) G_Loss: 1.058 G_ID: 0.171 G_Rec: 1.434 G_feat_match: 1.441 D_fake: 0.332 D_real: 1.001 D_loss: 1.334 Save test data saving the latest model (steps 267000)
Hello. In your situation I would save the checkpoint backup in a separate folder when it reaches 300,000 - 320.000 steps and (based on developer recommendations) continue train up to 500,000 steps and then check the results.
@osushilover Were you able to successfully train the 512 model? Could you share your results with us please?
I haven't train 512 model. Though I've trained 224 model, that was too bad. Please check me at #242
I rounded up the training I was running with Colab. And I have a backup of the pth file generated at that time. How can I start the training from where it left off?