Training diverges - Githubissues

torxxtorxx commented 2 years ago

I am trying to reproduce the FaceForensics results. I train with the default settings and after 1kimgs the training diverges while still being at FVD16 of 130. I am using all default settings. What could be the reason? I tried different seeds, same problem.

universome commented 2 years ago

Hi! Could you please give some details on your issue? In which metric does it diverge? What happens with the video quality at manual inspection? Here is our typical plot of training on FaceForensics 256x256. I suspect that there might also be an issue with dataset preprocessing

torxxtorxx commented 2 years ago

Hi, thanks. It diverges for FID and FVD16, I do not track the other metrics. Video quality is also not great. Did you do anything special with the preprocessing of FaceForensics? So it first goes down to FVD16 of 130 and after 1kimgs it goes up to 400-500 and does not go down anymore. But I stopped after 5-6 kimgs, so maybe I also have to train longer

universome commented 2 years ago

If the training diverges at 1k kimgs and does not improve for 4k kimgs, I think you are right at stopping it there. My first guess would be the dataset issue. We preprocess FaceForensics with the src/scripts/preprocess_ffs.py script that crops faces from videos. Did you use it? Just in case, here is our preprocessed dataset: https://disk.yandex.ru/d/wlWUPKgDZO7WWg (it might be the case that you are allowed to download it only if you've received the access to the original FaceForensics)

Also, do you change any hyperparameters (e.g. batch size, learning rate, etc)?

For example, here is the experiment config (saved as `experiment_config.yaml` by the launching script) for our experiment (but note that it is a bit different repo/code structure)

```yaml model: generator: source: networks use_noise: false w_dim: 512 c_dim: ${dataset.c_dim} input: type: temporal resolution: 4 has_const: true motion: time_enc: {} z_dim: 512 v_dim: 512 motion_z_distance: 32 gen_strategy: conv kernel_size: 11 long_history: true use_fractional_t: true fourier: true sampling: ${sampling} z_dim: 512 time_enc: cond_type: concat_const num_freqs: 256 min_period_len: 16 max_period_len: 1024 num_opened_dims: ${model.generator.time_enc.num_freqs} phase_dropout_std: 1.0 discriminator: source: networks mbstd_group_size: 4 concat_res: 16 sampling: ${sampling} num_frames_div_factor: 2 dummy_c: false dummy_synth_cfg: use_noise: false hyper_type: no_hyper loss_kwargs: source: StyleGAN2Loss type: non_saturating style_mixing_prob: 0.0 pl_weight: 0.0 motion_reg: coef: 0.0 video_consistent_aug: true optim: generator: {} discriminator: {} name: stylegan-v dataset: path: data/${dataset.name}.zip sampling: ${sampling} c_dim: 0 max_num_frames: 1024 fps: 30 resolution: 256 path_for_slurm_job: ${env.datasets_dir}/${dataset.name}.zip name: ffs_${dataset.resolution}_unstable sampling: name: random${sampling.num_frames_per_video}_max${sampling.max_dist} fps: ${dataset.fps} max_num_frames: ${dataset.max_num_frames} num_frames_per_video: 3 type: random total_dists: - 1 - 2 - 4 - 8 - 16 - 32 - 64 - 128 - 256 - 512 - 1024 - 2048 max_dist: 32 env: python_bin: ${env.project_path}/env-ampere/bin/python before_train_commands: - module unload cuda - module load cuda - module load cmake - module load gcc/8.2.0 - conda init bash - source $HOME/.bashrc - conda activate /home/skoroki/rnf/env-ampere torch_extensions_dir: /tmp/torch_extensions objects_to_copy: - ${env.project_path}/src - ${env.project_path}/configs symlinks_to_create: - ${env.project_path}/data tmp_dir: /tmp datasets_dir: null slurm_constraint: a100 project_path: /home/skoroki/video-sg symlink_output: /ibex/ai/home/skoroki/rnf/${experiments_dir}/${experiment_name_with_hash}/output env_args: project_dir: ${project_release_dir} python_bin: ${env.python_bin} python_script: ${project_release_dir}/src/infra/slurm_job.py num_gpus: 4 print_only: false use_qos: false git_hash: 6b8cd4a exp_suffix: default_hp experiment_name: ${dataset.name}_${model.name}_${sampling.name}_${exp_suffix} experiment_name_with_hash: ${experiment_name}-${git_hash} experiments_dir: experiments project_release_dir: ${env.project_path}/${experiments_dir}/${experiment_name_with_hash} job_sequence_length: 1 slurm_log_dir: ${project_release_dir} sbatch_args: constraint: ${env.slurm_constraint} time: 1-0 gres: gpu:${num_gpus} cpus-per-task: 5 mem: 256G cpus-per-gpu: 5 comment: ${experiment_name} sbatch_args_str: --constraint=a100 --time=1-0 --gres=gpu:4 --cpus-per-task=5 --mem=256G --cpus-per-gpu=5 --comment=ffs_256_unstable_stylegan-v_random3_max32_default_hp env_args_str: --project_dir=/home/skoroki/video-sg/experiments/ffs_256_unstable_stylegan-v_random3_max32_default_hp-6b8cd4a --python_bin=/home/skoroki/video-sg/env-ampere/bin/python --python_script=/home/skoroki/video-sg/experiments/ffs_256_unstable_stylegan-v_random3_max32_default_hp-6b8cd4a/src/infra/slurm_job.py training: outdir: ${project_release_dir} data: ${dataset.path} gpus: ${num_gpus} cfg: auto snap: 200 kimg: 25000 metrics: - fvd2048_16f - fvd2048_128f - fvd2048_128f_subsample8f - fid50k_full aug: ada mirror: true batch_size: 64 resume: null seed: 0 dry_run: false cond: false subset: null p: null target: 0.6 augpipe: bgc freezed: 0 fp32: false nhwc: false nobench: false allow_tf32: false num_workers: 3 ```

torxxtorxx commented 2 years ago

Thank you for the great help! I will try your dataset to make sure. I already had a preprocessed FaceForensics dataset which worked to reproduce other Video Generators. But I forgot to mention I only used batch size 32 and 2 instead of 3 frames during training. The difference between 2 and your optimal 3 frames wasn't significant, so I thought that this should be fine. I can also try batch size 64 by gradient accumulation but did not expect this to lead to a non converging result

universome commented 2 years ago

Hmm, using batch size of 32 instead of 64 and using 2 instead of 3 frames shouldn't lead to such problems. I've relaunched training on our version of the dataset from the current repo state, we'll report the results tomorrow

universome commented 2 years ago

So, I've just launched it from the current git hash (bfaad07) and here are the training curves:

So, it seems to converge fine, that's why I would suspect that you might have a different dataset or using different hyperparameters.

Here is the `experiment_config.yaml` of that experiment

```yaml model: generator: source: networks use_noise: false w_dim: 512 c_dim: ${dataset.c_dim} input: type: temporal motion: time_enc: {} z_dim: 512 v_dim: 512 motion_z_distance: ${model.generator.time_enc.min_period_len} gen_strategy: conv kernel_size: 11 use_fractional_t: true fourier: true sampling: ${sampling} z_dim: 512 time_enc: cond_type: concat_const dim: 256 min_period_len: 16 max_period_len: 1024 phase_dropout_std: 1.0 discriminator: source: networks mbstd_group_size: 4 sampling: ${sampling} concat_res: 16 num_frames_div_factor: 2 dummy_c: false loss_kwargs: source: StyleGAN2Loss style_mixing_prob: 0.0 pl_weight: 0.0 motion_reg: coef: 0.0 video_consistent_aug: true optim: generator: {} discriminator: {} name: stylegan-v dataset: path: data/${dataset.name}.zip sampling: ${sampling} c_dim: 0 max_num_frames: 1024 fps: 30 resolution: 256 path_for_slurm_job: ${env.datasets_dir}/${dataset.name}.zip name: ffs_${dataset.resolution}_unstable sampling: name: random${sampling.num_frames_per_video}_max${sampling.max_dist} fps: ${dataset.fps} max_num_frames: ${dataset.max_num_frames} num_frames_per_video: 3 type: random total_dists: - 1 - 2 - 4 - 8 - 16 - 32 - 64 - 128 - 256 - 512 - 1024 - 2048 max_dist: 32 env: python_bin: ${env.project_path}/env-ampere/bin/python before_train_commands: - module unload cuda - module load cuda - module load cmake - module load gcc/8.2.0 - conda init bash - source $HOME/.bashrc - conda activate /home/skoroki/stylegan-v/env-ampere torch_extensions_dir: /tmp/torch_extensions objects_to_copy: - ${env.project_path}/src - ${env.project_path}/configs symlinks_to_create: - ${env.project_path}/data tmp_dir: /tmp datasets_dir: null slurm_constraint: v100 project_path: /home/skoroki/video-sg symlink_output: /ibex/ai/home/skoroki/stylegan-v/${experiments_dir}/${experiment_name_with_hash}/output num_gpus: 4 print_only: false git_hash: bfaad07 exp_suffix: default-hp experiment_name: ${dataset.name}_${model.name}_${sampling.name}_${exp_suffix} experiment_name_with_hash: ${experiment_name}-${git_hash} experiments_dir: experiments project_release_dir: ${env.project_path}/${experiments_dir}/${experiment_name_with_hash} slurm: false job_sequence_length: 1 slurm_log_dir: ${project_release_dir} use_qos: false sbatch_args: constraint: ${env.slurm_constraint} time: 1-0 gres: gpu:${num_gpus} cpus-per-task: 5 mem: 256G cpus-per-gpu: 5 comment: ${experiment_name} sbatch_args_str: --constraint=v100 --time=1-0 --gres=gpu:4 --cpus-per-task=5 --mem=256G --cpus-per-gpu=5 --comment=ffs_256_unstable_stylegan-v_random3_max32_default-hp env_args: project_dir: ${project_release_dir} python_bin: ${env.python_bin} python_script: ${project_release_dir}/src/infra/slurm_job.py env_args_str: --project_dir=/home/skoroki/video-sg/experiments/ffs_256_unstable_stylegan-v_random3_max32_default-hp-bfaad07 --python_bin=/home/skoroki/video-sg/env-ampere/bin/python --python_script=/home/skoroki/video-sg/experiments/ffs_256_unstable_stylegan-v_random3_max32_default-hp-bfaad07/src/infra/slurm_job.py training: outdir: ${project_release_dir} data: ${dataset.path} gpus: ${num_gpus} cfg: auto snap: 200 kimg: 25000 metrics: - fvd2048_16f - fvd2048_128f - fvd2048_128f_subsample8f - fid50k_full aug: ada mirror: true batch_size: 64 resume: null seed: 0 dry_run: false cond: false subset: null p: null target: 0.6 augpipe: bgc freezed: 0 fp32: false nhwc: false nobench: false allow_tf32: false num_workers: 3 ```

torxxtorxx commented 2 years ago

Thank you, the problem is fixed now. Sorry for causing the extra work!

universome commented 2 years ago

No worries, feel free to ask any further questions if you'll have any!

universome / stylegan-v

Training diverges #7