Closed szh-bash closed 1 year ago
@szh-bash Please refer to the log and check if there is a message as follows: No checkpoint found at iteration 2000. I suspect you did not load the weight correctly. If there is any other question, please let me know.
@szh-bash Please refer to the log and check if there is a message as follows: No checkpoint found at iteration 2000. I suspect you did not load the weight correctly. If there is any other question, please let me know.
Yes there is This is a screenshot of the file information in the folder, what should i do?
After i print (pot, model_path, latest_checkpoint_path) in load_checkpoint() from "trainers/base.py" (line 242),
got these outputs.
snapshot_save_iter: 100
snapshot_save_epoch: 5
snapshot_save_start_iter: 100
snapshot_save_start_epoch: 0
image_save_iter: 100
eval_epoch: 1000000000
start_eval_epoch: 1000000000
max_epoch: 2000
max_iter: 1000000000
logging_iter: 10
image_to_tensorboard: True
which_iter: 2000
resume: False
checkpoints_dir: result
name: config/otavatar.yaml
phase: test
gen:
type: models.triplane::TriPlaneGenerator
param:
z_dim: 512
w_dim: 512
c_dim: 25
channel_base: 32768
channel_max: 512
mapping_kwargs:
num_layers: 2
rendering_kwargs:
depth_resolution: 48
depth_resolution_importance: 48
ray_start: 2.25
ray_end: 3.3
box_warp: 1
avg_camera_radius: 2.7
avg_camera_pivot: [0, 0, 0.2]
image_resolution: 512
disparity_space_sampling: False
clamp_mode: softplus
superresolution_module: models.superresolution.SuperresolutionHybrid8XDC
c_gen_conditioning_zero: False
c_scale: 1.0
superresolution_noise_mode: none
density_reg: 0.25
density_reg_p_dist: 0.004
reg_type: l1
decoder_lr_mul: 1.0
sr_antialias: True
num_fp16_res: 0
sr_num_fp16_res: 4
sr_kwargs:
channel_base: 32768
channel_max: 512
fused_modconv_default: inference_only
conv_clamp: None
img_resolution: 512
img_channels: 3
inference:
depth_resolution: 48
depth_resolution_importance: 48
ray_start: 2.25
ray_end: 3.3
box_warp: 1
image_resolution: 512
disparity_space_sampling: False
clamp_mode: softplus
superresolution_module: training.superresolution.SuperresolutionHybrid8XDC
c_gen_conditioning_zero: False
c_scale: 1.0
superresolution_noise_mode: none
density_reg: 0.25
density_reg_p_dis: 0.004
reg_type: l1
decoder_lr_mul: 1.0
sr_antialias: True
checkpoint: pretrained/ffhqrebalanced512-64.pth
dis:
type: discriminators.dummy
data:
name: dummy
type: data.dataset::HDTFDataset
num_workers: 1
path: ./datasets/hdtf_lmdb_inv
resolution: 512
semantic_radius: 13
frames_each_video: 2
train:
batch_size: 4
distributed: True
prefetch_factor: 1
val:
batch_size: 4
distributed: True
prefetch_factor: 1
cross_id: True
cross_id_target: WRA_EricCantor_000
test_data:
name: dummy
type: datasets.images
num_workers: 0
test:
is_lmdb: False
roots:
batch_size: 1
trainer:
model_average: False
model_average_beta: 0.9999
model_average_start_iteration: 1000
model_average_batch_norm_estimation_iteration: 30
model_average_remove_sn: True
image_to_tensorboard: False
hparam_to_tensorboard: False
distributed_data_parallel: pytorch
delay_allreduce: True
gan_relativistic: False
gen_step: 1
dis_step: 1
type: trainers.decouple_by_invert::FaceTrainer
use_sr: True
sr_iters: 10
accum_ratio:
G: 0.95
Warp: 0.95
inversion:
iterations: 100
warp_lr_mult: 100
asynchronous_update: successively
warp_update_iters: 10
loss_weight:
mask_rate: 1
inverse: 1.0
refine: 1.0
local: 10.0
TV: 1.0
monotonic: 1.0
pixel: 1
id: 1.0
p_norm: 0.0
a_norm: 0.0
a_mutual: 0.0
vgg_param_lr:
network: vgg19
layers: ['relu_1_1', 'relu_2_1', 'relu_3_1', 'relu_4_1', 'relu_5_1']
use_style_loss: True
num_scales: 2
style_to_perceptual: 250
vgg_param_sr:
network: vgg19
layers: ['relu_1_1', 'relu_2_1', 'relu_3_1', 'relu_4_1', 'relu_5_1']
use_style_loss: True
num_scales: 4
style_to_perceptual: 250
init:
type: xavier
gain: 0.02
cudnn:
deterministic: False
benchmark: True
pretrained_weight:
inference_args:
distributed: True
results_dir: ./eval_results
w_samples: 600
warp_optimizer:
type: adamw
lr: 0.001
refine_only: False
adam_beta1: 0.5
adam_beta2: 0.999
lr_policy:
iteration_mode: True
type: step
step_size: 10000
gamma: 0.2
weight_decay: 1
inverse_optimizer:
type: adam
lr: 0.01
adam_beta1: 0.9
adam_beta2: 0.999
gen_optimizer:
type: adam
lr: 0.0001
adam_beta1: 0.9
adam_beta2: 0.9999
sr_only: False
lr_policy:
iteration_mode: True
type: step
step_size: 10000
gamma: 0.2
camera_optimizer:
type: adamw
lr: 0.01
adam_beta1: 0.9
adam_beta2: 0.9999
warp:
type: models.controller::VideoCodeBook
param:
descriptor_nc: 256
mapping_layers: 3
mlp_layers: 5
if_use_pose: True
if_plus_scaling: False
if_short_cut: True
directions: 20
checkpoint: None
local_rank: 0
device: 0
logdir: result/config/otavatar.yaml
result/config/otavatar.yaml/*_iteration_000002000_checkpoint.pt
[]
No checkpoint found at iteration 2000.
It appears that there is an issue with the "model_path" variable and i found there existed a folder at "./result/cofng/otavatar.yaml/". What could be causing this problem and how can it be thoroughly resolved?
After moving epoch_00005_iteration_000002000_checkpoint.pt to "./result/config/otavatar.yml/", "No checkpoint found at iteration 2000." disappeared.
So is the logic that inference.py will create folder "./result/$name$" and we need to make sure model "*_iteration_\$which_iter\$_checkpoint.pt" be put in there?
Now it seems to be working normally.
BTW is it normal for the eyes to be out of sync and how to facilating trainning speed cause it seems no speed-up after changing "--nproc_per_node=1" to "--nproc_per_node=2" (both are 57s/iter on 3090 x1 or x2)
export CUDA_VISIBLE_DEVICES=0
python -m torch.distributed.launch --nproc_per_node=1 --master_port 12346 train_inversion.py \
--config ./config/otavatar.yaml \
--name otavatar
export CUDA_VISIBLE_DEVICES=0,1
python -m torch.distributed.launch --nproc_per_node=2 --master_port 12346 train_inversion.py \
--config ./config/otavatar.yaml \
--name otavatar
@szh-bash 1) Thanks for pointing out the typo! I will fix it in the readme. 2) According to our experiment, the eye synchronization is non-trival to achieve. Our explanation is: first, EG3D (as well as other GAN) is trained on FFHQ which does not have sufficient eye-closing faces, so closing eyes are not feasible with barely latent manipulation; second, as you may see in our pesudo-code in the paper, we only finetune EG3D with 1e-4 lr for 2000 iterations, so there is minor modification on the model weights. Maybe other mechnism e.x. dense image warping can achieve natural eye synchronization, but it is beyond the scope of our paper. 3) "--nproc_per_node=2" makes the 2 gpus training the model with the same batchsize, therefore the total batchsize is doubled.
@theEricMa Thank you for the detailed explanation, best wishes!
@theEricMa Hi, I found the model i trained behaviour bad than pretrain model, is the reason that 6 epoch(pretrain model) is better than 1 epoch?
But i got nan after 2290 iters in epoch 0, how to train 5 epoch?
Hi, may I know how much GPU is used for the training? Mine is 4 A100s (80GB mem), so the batchsize is 8(per GPU) * 4 (GPU num) = 32, therefore the 2000 iters will spend more than 1 epoch. If you cannot support batchsize=8 per GPU, please try more GPUs. Larger batchsize leads to more stable training.
By the way, training with limit GPU resoursce would be feasible with gradient accumulation, if you prefer this way, please modify the code here by first making this function takes the iteration number also as input, and the both optimizers do .step()
only when iteration number % <accumulate_size>
== 0
How to check current batchsize(per GPU)? Does this log mean batchsize=4 per GPU? Training on 3090s (24GB mem) x2|x4|x8 showed same batchsize here.
Yes, the screenshot means batch size = 4 per GPU for sure. The training is configurable in config/otavatar.yaml
, you can change the batch size there.
I also come up with this problem. I have run the training experiment for serveral times, each time the loss becomes nan in epoch 0 or epoch 1. 4x A100, batch_size = 8 or 4, how to fix this ?
You can try the following modification:
Decrease the style_to_perceptual
for both vgg_param_lr
and vgg_param_sr
. It is used for supervising the 'style' consistency using Gram matrices but is not the most important loss term. You can even choose to not use it by setting both use_style_loss
to False.
in the optimize_parameters()
function from trainers/decouple_by_invert.py, change the summation of all the loss terms into the following, so as to isolate the nan term.
Please contact me if both operations do not work.
After run this command
got these videos, It is obvious that the pose has only been transferred from the target image. How to fix it?