neuralchen / SimSwap

An arbitrary face-swapping framework on images and videos with one single trained model!
Other
4.55k stars 895 forks source link

How to RESUME training #252

Closed osushilover closed 2 years ago

osushilover commented 2 years ago

I rounded up the training I was running with Colab. And I have a backup of the pth file generated at that time. How can I start the training from where it left off?

netrunner-exe commented 2 years ago

You should have four files, for example: Знімок екрана 2022-05-02 125105

Then you need to add in command: --which_epoch 61000 (in my example, you need add yours) --continue_train True

osushilover commented 2 years ago

Thanks for your very clear guidance! I will add the option and resume training.

However, it took me about 10 hours to iterate to 60,000 epochs, even using V100 on colab. If anyone is willing to share their trained models on model zoo, I think it would be faster for me to resume training with them. Or I can take that role. Please let me know what you think, @netrunner-exe .

netrunner-exe commented 2 years ago

Thanks for your very clear guidance! I will add the option and resume training.

However, it took me about 10 hours to iterate to 60,000 epochs, even using V100 on colab. If anyone is willing to share their trained models on model zoo, I think it would be faster for me to resume training with them. Or I can take that role.

What batch_size you used for training? If you have Colab Pro and Tesla V100 you can try higher batch_size. If you want you can mail me netrunner.exe@gmail.com - I'll write you a couple of good tricks for training in Google Colab :)

osushilover commented 2 years ago

Training under Colab Pro's high memory limit was 4 batch sizes, even with the Tesla V100. Even if the batch size was increased, the message "Runtime error: CUDA out of memory..." would appear, so making training with the recommended batch size of 16 is impossible.

I hope I can be of some help to you guys. I will get back to you soon, @netrunner-exe . Thank you.

netrunner-exe commented 2 years ago

Training under Colab Pro's high memory limit was 4 batch sizes, even with the Tesla V100. Even if the batch size was increased, the message "Runtime error: CUDA out of memory..." would appear, so making training with the recommended batch size of 16 is impossible.

I hope I can be of some help to you guys. I will get back to you soon, @netrunner-exe . Thank you.

What dataset you use and what size you train? It's strange, I managed to run the code with BatchSize 22 on Tesla T4 in Google Colab and 17 on Tesla K80. Are you sure you have a GPU enabled in the settings? Try to run !nvidia-smi --query-gpu=gpu_name,driver_version,memory.total --format=csv and post the output

osushilover commented 2 years ago

That's great! But I'm standing here. v100_high

netrunner-exe commented 2 years ago

That's great! But I'm standing here. v100_high

What dataset you use and what size you train?

osushilover commented 2 years ago

Indeed! It's 512! It is possible that the size of the dataset I am dealing with is too large. This is it. And thank you for sharing the assets.

netrunner-exe commented 2 years ago

Maybe train 512 with batch size larger than 8 is not possible even on v100. For advanced training this dataset is small, it is best to train on a full Vggface-2 of course. Unfortunately I don't have croped and aligned 512 VGGFACE2. Have you tried training on 224 provided in readme?

osushilover commented 2 years ago

I didn't know that you don't have croped and aligned 512 VGGFACE2. Though I have not tried 224 as I expect a high quality FaceSwap, Shall I make it in 224 for you all?

netrunner-exe commented 2 years ago

I didn't know that you don't have croped and aligned 512 VGGFACE2. Though I have not tried 224 as I expect a high quality FaceSwap, Shall I make it in 224 for you all?

Probably you misunderstood me. Those datasets that I posted are croped and aligned to 512 and 224 but now I realized that their size (around 16000 images vs full VGGFace-2 around 600000) is very small for a normal training. Maybe I'll delete them later, because now I don't see much point in training with them. Perhaps someone here will share their experience of training 512 or the full version of croped and aligned 512 VGGFace-2.

netrunner-exe commented 2 years ago

I want to share my experience of training in Google Colab. Some tips that i used:

  1. You must have at least 10 GB of free space on Google Drive to save checkpoints.

  2. Open this link (its official 224 dataset from @neuralchen) and create a shortcut on your Google drive - this will not take up any space on it but will allow you to copy the dataset from Gdrive to Colab and no need to download it from wget Знімок екрана 2022-05-02 180947

  3. Then after installing depencies i use this code:

    
    # Mount Gdrive
    from google.colab import drive
    drive.mount('/content/drive')

Copy from Gdrive

%cd /content/SimSwap

!mkdir /content/SimSwap/datasets &> /dev/null !tar -xzf "/content/drive/MyDrive/vggface2_crop_arcfacealign_224.tar" --directory ./datasets

New training 224:

Train 224 model

name - Name of the experiment. It decides where to store samples and models

dataset - Path to the face swapping dataset

sample_freq - Frequence for sampling

model_freq - Frequence for saving the model

checkpoints_dir - Models are saved here

%cd /content/SimSwap

name = "simswap_224_test" dataset = "/content/SimSwap/datasets/vggface2_crop_arcfacealign_224" sample_freq = 1000 model_freq = 1000 batchSize = 17

checkpoints_dir = "/content/drive/MyDrive/SimSwap/checkpoints"

!python -W ignore train.py --name {name} --gpu_ids 0 --use_tensorboard False --batchSize {batchSize} --model_freq {model_freq} --sample_freq {sample_freq} --checkpoints_dir {checkpoints_dir} --dataset {dataset} --Gdeep False

Continue train 224:

Continue train 224 model

name - Name of the experiment. It decides where to store samples and models

dataset - Path to the face swapping dataset

load_pretrain - Load the pretrained model from the specified location

sample_freq - Frequence for sampling

model_freq - Frequence for saving the model

checkpoints_dir - Models are saved here

continue_train - Continue training: load the latest model

%cd /content/SimSwap

name = "simswap_224_test" dataset = "/content/SimSwap/datasets/vggface2_crop_arcfacealign_224" batchSize = 17 which_epoch = 57000 sample_freq = 1000 model_freq = 1000 checkpoints_dir = "/content/drive/MyDrive/SimSwap/checkpoints" load_pretrain = "/content/drive/MyDrive/SimSwap/checkpoints/simswap_224_test"

!python -W ignore train.py --name {name} --gpu_ids 0 --which_epoch {which_epoch} --batchSize {batchSize} --load_pretrain {load_pretrain} --use_tensorboard False --continue_train True --model_freq {model_freq} --sample_freq {sample_freq} --checkpoints_dir {checkpoints_dir} --dataset {dataset} --Gdeep False


If you catch `Cuda out of memory` - try lower `batchSize` (20, 15, 8, and other). You can experiment with values. On Tesla T4 train 224 works with `batchSize` 22 and K80 - 17. If you use V100 you can try higher value.

All checkpoints will be stored in Google Drive, so even if the session ends, they will remain and you will continue training.
When you continue training. you need to change `which_epoch` value to yours and file `iter.txt` must exist in folder too - the last saved epoch is written to it.

And most importantly - do not forget to keep track of Google Drive so that it does not run out of space! Periodically delete old points when you are sure that new ones have been saved and empty the trash. If this is not done, the space on the Google drive is running out and the checkpoints will no longer be saved to it.
If anyone needs it, here's mine [Colab notebook](https://colab.research.google.com/drive/19JbqRGDa4iqBuwl20pnXzlp0ZiXcAkwM?usp=sharing) which i use
osushilover commented 2 years ago

I admire you, @netrunner-exe.

netrunner-exe commented 2 years ago

I admire you, @netrunner-exe.

Thank you! It would be really great if you shared your result when you finish the training :) Unfortunately, on the free Google Colab with three accounts, I only reached 60,000 it. with training 224 about a week

nonlin commented 2 years ago

I want to share my experience of training in Google Colab. Some tips that i used:

  1. You must have at least 10 GB of free space on Google Drive to save checkpoints.
  2. Open this link (its official 224 dataset from @neuralchen) and create a shortcut on your Google drive - this will not take up any space on it but will allow you to copy the dataset from Gdrive to Colab and no need to download it from wget Знімок екрана 2022-05-02 180947
  3. Then after installing depencies i use this code:
# Mount Gdrive
from google.colab import drive
drive.mount('/content/drive')

# Copy from Gdrive
%cd /content/SimSwap

!mkdir /content/SimSwap/datasets &> /dev/null
!tar -xzf "/content/drive/MyDrive/vggface2_crop_arcfacealign_224.tar" --directory ./datasets

New training 224:

# Train 224 model
# name - Name of the experiment. It decides where to store samples and models
# dataset - Path to the face swapping dataset
# sample_freq - Frequence for sampling
# model_freq - Frequence for saving the model
# checkpoints_dir - Models are saved here
%cd /content/SimSwap

name = "simswap_224_test"
dataset = "/content/SimSwap/datasets/vggface2_crop_arcfacealign_224"
sample_freq = 1000
model_freq = 1000
batchSize = 17

checkpoints_dir = "/content/drive/MyDrive/SimSwap/checkpoints"

!python -W ignore train.py --name {name} --gpu_ids 0 --use_tensorboard False --batchSize {batchSize} --model_freq {model_freq} --sample_freq {sample_freq} --checkpoints_dir {checkpoints_dir} --dataset {dataset} --Gdeep False

Continue train 224:

# Continue train 224 model
# name - Name of the experiment. It decides where to store samples and models
# dataset - Path to the face swapping dataset
# load_pretrain - Load the pretrained model from the specified location
# sample_freq - Frequence for sampling
# model_freq - Frequence for saving the model
# checkpoints_dir - Models are saved here
# continue_train - Continue training: load the latest model
%cd /content/SimSwap

name = "simswap_224_test"
dataset = "/content/SimSwap/datasets/vggface2_crop_arcfacealign_224"
batchSize = 17
which_epoch = 57000
sample_freq = 1000
model_freq = 1000
checkpoints_dir = "/content/drive/MyDrive/SimSwap/checkpoints"
load_pretrain = "/content/drive/MyDrive/SimSwap/checkpoints/simswap_224_test"

!python -W ignore train.py --name {name} --gpu_ids 0 --which_epoch {which_epoch} --batchSize {batchSize} --load_pretrain {load_pretrain} --use_tensorboard False --continue_train True --model_freq {model_freq} --sample_freq {sample_freq} --checkpoints_dir {checkpoints_dir} --dataset {dataset} --Gdeep False

If you catch Cuda out of memory - try lower batchSize (20, 15, 8, and other). You can experiment with values. On Tesla T4 train 224 works with batchSize 22 and K80 - 17. If you use V100 you can try higher value.

All checkpoints will be stored in Google Drive, so even if the session ends, they will remain and you will continue training. When you continue training. you need to change which_epoch value to yours and file iter.txt must exist in folder too - the last saved epoch is written to it.

And most importantly - do not forget to keep track of Google Drive so that it does not run out of space! Periodically delete old points when you are sure that new ones have been saved and empty the trash. If this is not done, the space on the Google drive is running out and the checkpoints will no longer be saved to it. If anyone needs it, here's mine Colab notebook which i use

Is there a link for the 512 .tar file? I'm confused between the files required. The google drive link for 512 is broken up into like 5 files? 4 numbers seqeunced zip files and one standalone zip file. Can you make sense of how to train for 512 given the exact same setup you gave for 224? I get everything but what to switch out for the "its official 224 dataset from @neuralchen" 512 equivalent?

netrunner-exe commented 2 years ago

I want to share my experience of training in Google Colab. Some tips that i used:

  1. You must have at least 10 GB of free space on Google Drive to save checkpoints.
  2. Open this link (its official 224 dataset from @neuralchen) and create a shortcut on your Google drive - this will not take up any space on it but will allow you to copy the dataset from Gdrive to Colab and no need to download it from wget Знімок екрана 2022-05-02 180947
  3. Then after installing depencies i use this code:
# Mount Gdrive
from google.colab import drive
drive.mount('/content/drive')

# Copy from Gdrive
%cd /content/SimSwap

!mkdir /content/SimSwap/datasets &> /dev/null
!tar -xzf "/content/drive/MyDrive/vggface2_crop_arcfacealign_224.tar" --directory ./datasets

New training 224:

# Train 224 model
# name - Name of the experiment. It decides where to store samples and models
# dataset - Path to the face swapping dataset
# sample_freq - Frequence for sampling
# model_freq - Frequence for saving the model
# checkpoints_dir - Models are saved here
%cd /content/SimSwap

name = "simswap_224_test"
dataset = "/content/SimSwap/datasets/vggface2_crop_arcfacealign_224"
sample_freq = 1000
model_freq = 1000
batchSize = 17

checkpoints_dir = "/content/drive/MyDrive/SimSwap/checkpoints"

!python -W ignore train.py --name {name} --gpu_ids 0 --use_tensorboard False --batchSize {batchSize} --model_freq {model_freq} --sample_freq {sample_freq} --checkpoints_dir {checkpoints_dir} --dataset {dataset} --Gdeep False

Continue train 224:

# Continue train 224 model
# name - Name of the experiment. It decides where to store samples and models
# dataset - Path to the face swapping dataset
# load_pretrain - Load the pretrained model from the specified location
# sample_freq - Frequence for sampling
# model_freq - Frequence for saving the model
# checkpoints_dir - Models are saved here
# continue_train - Continue training: load the latest model
%cd /content/SimSwap

name = "simswap_224_test"
dataset = "/content/SimSwap/datasets/vggface2_crop_arcfacealign_224"
batchSize = 17
which_epoch = 57000
sample_freq = 1000
model_freq = 1000
checkpoints_dir = "/content/drive/MyDrive/SimSwap/checkpoints"
load_pretrain = "/content/drive/MyDrive/SimSwap/checkpoints/simswap_224_test"

!python -W ignore train.py --name {name} --gpu_ids 0 --which_epoch {which_epoch} --batchSize {batchSize} --load_pretrain {load_pretrain} --use_tensorboard False --continue_train True --model_freq {model_freq} --sample_freq {sample_freq} --checkpoints_dir {checkpoints_dir} --dataset {dataset} --Gdeep False

If you catch Cuda out of memory - try lower batchSize (20, 15, 8, and other). You can experiment with values. On Tesla T4 train 224 works with batchSize 22 and K80 - 17. If you use V100 you can try higher value. All checkpoints will be stored in Google Drive, so even if the session ends, they will remain and you will continue training. When you continue training. you need to change which_epoch value to yours and file iter.txt must exist in folder too - the last saved epoch is written to it. And most importantly - do not forget to keep track of Google Drive so that it does not run out of space! Periodically delete old points when you are sure that new ones have been saved and empty the trash. If this is not done, the space on the Google drive is running out and the checkpoints will no longer be saved to it. If anyone needs it, here's mine Colab notebook which i use

Is there a link for the 512 .tar file? I'm confused between the files required. The google drive link for 512 is broken up into like 5 files? 4 numbers seqeunced zip files and one standalone zip file. Can you make sense of how to train for 512 given the exact same setup you gave for 224? I get everything but what to switch out for the "its official 224 dataset from @neuralchen" 512 equivalent?

What link are you talking about? In the answer that you reposted there is no hint of training 512, etc. If you want to get a normal answer - normally formulate a question.

nonlin commented 2 years ago

@netrunner-exe There is some confusion on part which is probably why you are confused.

  1. There exists a .tar file for 224 this. https://drive.google.com/file/d/19pWvdEHS-CEG6tW3PdxdtZ5QEymVjImc/view
  2. I can't find a 512 version but I think it is this https://drive.google.com/drive/folders/1ZHy7jrd6cGb2lUa4qYugXe41G_Ef9Ibw
  3. If the .tar file for the 512 version is this then do you know how to combine them or extract them or a drive to copy them into the collab project such as the one you provided so that one can train for 512 on google collab? If that isn't it, where would it be or how is it created?
netrunner-exe commented 2 years ago

Try adding a shortcut to this folder on your Google drive, mount it, unpack the archive. If you don't have at least 100 GB free space in Colab - I do not see any point in trying to unpack them (<1000000 files) on Colab hdd.

netrunner-exe commented 2 years ago

The full version of the 512 dataset that could be used for training in Colab has not yet been posted. Do you use Pro Colab or free?

nonlin commented 2 years ago

I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe

netrunner-exe commented 2 years ago

I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe

What BatchSize you using?

nonlin commented 2 years ago

I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe

What BatchSize you using?

23 sitting at 15K right now.

netrunner-exe commented 2 years ago

I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe

What BatchSize you using?

23 sitting at 15K right now.

Did you try higher BatchSize or when you set higher than 23 you get Cuda out of memory?

nonlin commented 2 years ago

I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe

What BatchSize you using?

23 sitting at 15K right now.

Did you try higher BatchSize or when you set higher than 23 you get Cuda out of memory?

It's red lining as is in terms of GPU mem. This is my first pass at it.

netrunner-exe commented 2 years ago

I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe

What BatchSize you using?

23 sitting at 15K right now.

Did you try higher BatchSize or when you set higher than 23 you get Cuda out of memory?

It's red lining as is in terms of GPU mem. This is my first pass at it.

If you want - write your mail, I will share my checkpoint, I'm at 81000 with batch size 22. You can continue. To be honest, I'm already sick train with three accounts on Free Colab :)

nonlin commented 2 years ago

I'm using Pro Colab. I'm going to let it sit and train the 224 for now but was curious about how to handle 512. I'm guessing if I get it to 600000 iterations on 224 that will be more than the checkpoint they publicly released? @netrunner-exe

What BatchSize you using?

23 sitting at 15K right now.

Did you try higher BatchSize or when you set higher than 23 you get Cuda out of memory?

It's red lining as is in terms of GPU mem. This is my first pass at it.

If you want - write your mail, I will share my checkpoint, I'm at 81000 with batch size 22. You can continue. To be honest, I'm already sick train with three accounts on Free Colab :)

EDIT: Just made an unofficial discord for this https://discord.gg/6aXkTUqr3B You can add me on discord (removed).

osushilover commented 2 years ago

I am training for a few days on Colab Pro+ using TeslaV100 with a batch size of 23. I want to create the highest quality FaceSwap possible, but when should I complete the training? Would these loss values and preview images be helpful?

( step: 266199, ) G_Loss: 0.798 G_ID: 0.149 G_Rec: 1.728 G_feat_match: 1.402 D_fake: 0.469 D_real: 0.214 D_loss: 0.683 ( step: 266399, ) G_Loss: 1.620 G_ID: 0.148 G_Rec: 1.570 G_feat_match: 1.389 D_fake: 0.152 D_real: 0.330 D_loss: 0.482 ( step: 266599, ) G_Loss: 1.035 G_ID: 0.158 G_Rec: 1.300 G_feat_match: 1.395 D_fake: 0.237 D_real: 0.688 D_loss: 0.926 ( step: 266799, ) G_Loss: 0.986 G_ID: 0.169 G_Rec: 1.261 G_feat_match: 1.311 D_fake: 0.144 D_real: 0.351 D_loss: 0.494 ( step: 266999, ) G_Loss: 1.058 G_ID: 0.171 G_Rec: 1.434 G_feat_match: 1.441 D_fake: 0.332 D_real: 1.001 D_loss: 1.334 Save test data saving the latest model (steps 267000) step_267000

netrunner-exe commented 2 years ago

Hello. In your situation I would save the checkpoint backup in a separate folder when it reaches 300,000 - 320.000 steps and (based on developer recommendations) continue train up to 500,000 steps and then check the results.

PeterLorre commented 2 years ago

@osushilover Were you able to successfully train the 512 model? Could you share your results with us please?

osushilover commented 2 years ago

I haven't train 512 model. Though I've trained 224 model, that was too bad. Please check me at #242