Setting of GoPro training parameters

plusgood-steven commented 1 year ago

Hello, I noticed that the parameter settings in the paper differ from the ones in the YAML file. For example, the optimizer is Adam in the former and AdamW in the latter. Additionally, the number of training iterations is 600,000 in the former and 300,000 in the latter. The patch size is 256 in the former and 128 in the latter. Which settings should I follow in practice?

I look forward to hearing back from you.

kkkls commented 1 year ago

Hello, it takes 600,000 iterations to achieve the results in the paper. The patch size should be set to 256; if set to 128, performance will decrease. I hope this helps you.

plusgood-steven commented 1 year ago

Hello, Thank you for responding to my question. I truly appreciate your prompt and helpful reply. Thank you once again for your response.

sjxAndy commented 1 year ago

Hello, it takes 600,000 iterations to achieve the results in the paper. The patch size should be set to 256; if set to 128, performance will decrease. I hope this helps you.

Hello, thank you for your fantastic work and code! I encountered some problems while reproducing the results in the paper. I used the same train setting on GoPro, but only got psnr around 30.7 after 600K iterations. Do you know which setting is wrong? The config file I used is pasted below(I changed the dataset config to fit my training env)

name: GoPro_fftformer3
model_type: ImageRestorationModel
scale: 1
num_gpu: 8
manual_seed: 42

datasets:
  train:
    name: gopro-train
    type: PairedImageDataset
    sdk_gt: s3://Deblur/GoPro/GOPRO_Large/train/
    sdk_lq: s3://Deblur/GoPro/GOPRO_Large/train/
    filename_tmpl: '{}'
    io_backend:
      type: lmdb

    gt_size: 256
    use_flip: true
    use_rot: true

    # data loader
    use_shuffle: true
    num_worker_per_gpu: 4
    batch_size_per_gpu: 1
    dataset_enlarge_ratio: 1
    prefetch_mode: ~

  val:
    name: gopro-test
    type: PairedImageDataset
    sdk_gt: s3://Deblur/GoPro/GOPRO_Large/test/
    sdk_lq: s3://Deblur/GoPro/GOPRO_Large/test/
    io_backend:
      type: lmdb

network_g:
  type: fftformer
  inp_channels: 3
  out_channels: 3
  dim: 48
  num_blocks: [6,6,12]
  num_refinement_blocks: 4
  ffn_expansion_factor: 3
  bias: False

# path
path:
  pretrain_network_g:
  strict_load_g:
  resume_state: 

# training settings
train:
  optim_g:
    type: AdamW
    lr: !!float 1e-3
    weight_decay: !!float 1e-3
    betas: [0.9, 0.9]

  scheduler:
    type: TrueCosineAnnealingLR
    T_max: 600000
    eta_min: !!float 1e-7

  total_iter: 600000
  warmup_iter: -1 # no warm up

  # losses
  pixel_opt:
    type: L1Loss
    loss_weight: 1.0
    reduction: mean

  fft_loss_opt:
    type: FFTLoss
    loss_weight: 0.1
    reduction: mean

# validation settings
val:
  val_freq: !!float 1e4
  save_img: false

  metrics:
    psnr: # metric name, can be arbitrary
      type: calculate_psnr
      crop_border: 0
      test_y_channel: false
    ssim:
      type: calculate_ssim
      crop_border: 0
      test_y_channel: false

# logging settings
logger:
  print_freq: 100
  save_checkpoint_freq: !!float 1e4
  use_tb_logger: true
  wandb:
    project: ~
    resume_id: ~

# dist training settings
dist_params:
  backend: nccl
  port: 29500

GuuJi-cj commented 1 year ago

Hello, it takes 600,000 iterations to achieve the results in the paper. The patch size should be set to 256; if set to 128, performance will decrease. I hope this helps you.

Hello, thank you for your fantastic work and code! I encountered some problems while reproducing the results in the paper. I used the same train setting on GoPro, but only got psnr around 30.7 after 600K iterations. Do you know which setting is wrong? The config file I used is pasted below(I changed the dataset config to fit my training env)

name: GoPro_fftformer3
model_type: ImageRestorationModel
scale: 1
num_gpu: 8
manual_seed: 42

datasets:
  train:
    name: gopro-train
    type: PairedImageDataset
    sdk_gt: s3://Deblur/GoPro/GOPRO_Large/train/
    sdk_lq: s3://Deblur/GoPro/GOPRO_Large/train/
    filename_tmpl: '{}'
    io_backend:
      type: lmdb

    gt_size: 256
    use_flip: true
    use_rot: true

    # data loader
    use_shuffle: true
    num_worker_per_gpu: 4
    batch_size_per_gpu: 1
    dataset_enlarge_ratio: 1
    prefetch_mode: ~

  val:
    name: gopro-test
    type: PairedImageDataset
    sdk_gt: s3://Deblur/GoPro/GOPRO_Large/test/
    sdk_lq: s3://Deblur/GoPro/GOPRO_Large/test/
    io_backend:
      type: lmdb

network_g:
  type: fftformer
  inp_channels: 3
  out_channels: 3
  dim: 48
  num_blocks: [6,6,12]
  num_refinement_blocks: 4
  ffn_expansion_factor: 3
  bias: False

# path
path:
  pretrain_network_g:
  strict_load_g:
  resume_state: 

# training settings
train:
  optim_g:
    type: AdamW
    lr: !!float 1e-3
    weight_decay: !!float 1e-3
    betas: [0.9, 0.9]

  scheduler:
    type: TrueCosineAnnealingLR
    T_max: 600000
    eta_min: !!float 1e-7

  total_iter: 600000
  warmup_iter: -1 # no warm up

  # losses
  pixel_opt:
    type: L1Loss
    loss_weight: 1.0
    reduction: mean

  fft_loss_opt:
    type: FFTLoss
    loss_weight: 0.1
    reduction: mean

# validation settings
val:
  val_freq: !!float 1e4
  save_img: false

  metrics:
    psnr: # metric name, can be arbitrary
      type: calculate_psnr
      crop_border: 0
      test_y_channel: false
    ssim:
      type: calculate_ssim
      crop_border: 0
      test_y_channel: false

# logging settings
logger:
  print_freq: 100
  save_checkpoint_freq: !!float 1e4
  use_tb_logger: true
  wandb:
    project: ~
    resume_id: ~

# dist training settings
dist_params:
  backend: nccl
  port: 29500

Encountered with the same problem, I have already run 400K iterations but only got about 29.7dB, and the config is the same to you. Did you solve it?

HanzhouLiu commented 1 year ago

Hello, it takes 600,000 iterations to achieve the results in the paper. The patch size should be set to 256; if set to 128, performance will decrease. I hope this helps you.

Hello, thank you for your fantastic work and code! I encountered some problems while reproducing the results in the paper. I used the same train setting on GoPro, but only got psnr around 30.7 after 600K iterations. Do you know which setting is wrong? The config file I used is pasted below(I changed the dataset config to fit my training env)

name: GoPro_fftformer3
model_type: ImageRestorationModel
scale: 1
num_gpu: 8
manual_seed: 42

datasets:
  train:
    name: gopro-train
    type: PairedImageDataset
    sdk_gt: s3://Deblur/GoPro/GOPRO_Large/train/
    sdk_lq: s3://Deblur/GoPro/GOPRO_Large/train/
    filename_tmpl: '{}'
    io_backend:
      type: lmdb

    gt_size: 256
    use_flip: true
    use_rot: true

    # data loader
    use_shuffle: true
    num_worker_per_gpu: 4
    batch_size_per_gpu: 1
    dataset_enlarge_ratio: 1
    prefetch_mode: ~

  val:
    name: gopro-test
    type: PairedImageDataset
    sdk_gt: s3://Deblur/GoPro/GOPRO_Large/test/
    sdk_lq: s3://Deblur/GoPro/GOPRO_Large/test/
    io_backend:
      type: lmdb

network_g:
  type: fftformer
  inp_channels: 3
  out_channels: 3
  dim: 48
  num_blocks: [6,6,12]
  num_refinement_blocks: 4
  ffn_expansion_factor: 3
  bias: False

# path
path:
  pretrain_network_g:
  strict_load_g:
  resume_state: 

# training settings
train:
  optim_g:
    type: AdamW
    lr: !!float 1e-3
    weight_decay: !!float 1e-3
    betas: [0.9, 0.9]

  scheduler:
    type: TrueCosineAnnealingLR
    T_max: 600000
    eta_min: !!float 1e-7

  total_iter: 600000
  warmup_iter: -1 # no warm up

  # losses
  pixel_opt:
    type: L1Loss
    loss_weight: 1.0
    reduction: mean

  fft_loss_opt:
    type: FFTLoss
    loss_weight: 0.1
    reduction: mean

# validation settings
val:
  val_freq: !!float 1e4
  save_img: false

  metrics:
    psnr: # metric name, can be arbitrary
      type: calculate_psnr
      crop_border: 0
      test_y_channel: false
    ssim:
      type: calculate_ssim
      crop_border: 0
      test_y_channel: false

# logging settings
logger:
  print_freq: 100
  save_checkpoint_freq: !!float 1e4
  use_tb_logger: true
  wandb:
    project: ~
    resume_id: ~

# dist training settings
dist_params:
  backend: nccl
  port: 29500

Encountered with the same problem, I have already run 400K iterations but only got about 29.7dB, and the config is the same to you. Did you solve it?

Did you reproduce the same experiment results as shown in the paper?

kkkls commented 1 year ago

I apologize for not replying promptly. I have been busy with other tasks during this period. We pre-sliced the GoPro dataset into patches of size 512x512 during training. Our subsequent work based on this paper was able to achieve the results mentioned in the paper. We hope this answer is helpful to you.

Calvin11311 commented 11 months ago

Hello, it takes 600,000 iterations to achieve the results in the paper. The patch size should be set to 256; if set to 128, performance will decrease. I hope this helps you.

请问整个训练过程只利用600,000迭代吗？有无渐进训练或其他补充训练策略？

kkkls commented 11 months ago

Hello, it takes 600,000 iterations to achieve the results in the paper. The patch size should be set to 256; if set to 128, performance will decrease. I hope this helps you.

请问整个训练过程只利用600,000迭代吗？有无渐进训练或其他补充训练策略？

你好，CVPR这个版本的训练一共用了60w次迭代，具体训练时我们会先用128128的patch size和64的batch size训练30w次迭代，学习率采用CosineAnnealingLR从1e-3到1e-7,之后会用256256的patch size和16的batch size训练30w次迭代，学习率采用CosineAnnealingLR从5e-4到1e-7。

kkkls commented 11 months ago

Hello, it takes 600,000 iterations to achieve the results in the paper. The patch size should be set to 256; if set to 128, performance will decrease. I hope this helps you.

Hello, thank you for your fantastic work and code! I encountered some problems while reproducing the results in the paper. I used the same train setting on GoPro, but only got psnr around 30.7 after 600K iterations. Do you know which setting is wrong? The config file I used is pasted below(I changed the dataset config to fit my training env)

name: GoPro_fftformer3
model_type: ImageRestorationModel
scale: 1
num_gpu: 8
manual_seed: 42

datasets:
  train:
    name: gopro-train
    type: PairedImageDataset
    sdk_gt: s3://Deblur/GoPro/GOPRO_Large/train/
    sdk_lq: s3://Deblur/GoPro/GOPRO_Large/train/
    filename_tmpl: '{}'
    io_backend:
      type: lmdb

    gt_size: 256
    use_flip: true
    use_rot: true

    # data loader
    use_shuffle: true
    num_worker_per_gpu: 4
    batch_size_per_gpu: 1
    dataset_enlarge_ratio: 1
    prefetch_mode: ~

  val:
    name: gopro-test
    type: PairedImageDataset
    sdk_gt: s3://Deblur/GoPro/GOPRO_Large/test/
    sdk_lq: s3://Deblur/GoPro/GOPRO_Large/test/
    io_backend:
      type: lmdb

network_g:
  type: fftformer
  inp_channels: 3
  out_channels: 3
  dim: 48
  num_blocks: [6,6,12]
  num_refinement_blocks: 4
  ffn_expansion_factor: 3
  bias: False

# path
path:
  pretrain_network_g:
  strict_load_g:
  resume_state: 

# training settings
train:
  optim_g:
    type: AdamW
    lr: !!float 1e-3
    weight_decay: !!float 1e-3
    betas: [0.9, 0.9]

  scheduler:
    type: TrueCosineAnnealingLR
    T_max: 600000
    eta_min: !!float 1e-7

  total_iter: 600000
  warmup_iter: -1 # no warm up

  # losses
  pixel_opt:
    type: L1Loss
    loss_weight: 1.0
    reduction: mean

  fft_loss_opt:
    type: FFTLoss
    loss_weight: 0.1
    reduction: mean

# validation settings
val:
  val_freq: !!float 1e4
  save_img: false

  metrics:
    psnr: # metric name, can be arbitrary
      type: calculate_psnr
      crop_border: 0
      test_y_channel: false
    ssim:
      type: calculate_ssim
      crop_border: 0
      test_y_channel: false

# logging settings
logger:
  print_freq: 100
  save_checkpoint_freq: !!float 1e4
  use_tb_logger: true
  wandb:
    project: ~
    resume_id: ~

# dist training settings
dist_params:
  backend: nccl
  port: 29500

The possible reason for this issue could be that your learning rate does not match the batch size. To resolve this, you can decrease the learning rate. You can use our updated GoPro.yml for training.

Calvin11311 commented 11 months ago

1e-7

感谢您的回答，请问训练完成后您是如何选取最优的训练文件呢？以及我选择latest文件得到的测试结果和训练log中的输出结果差异较大，测试只得到22的psnr值，log文件中对应的iter值为32，是test.sh文件和训练过程中的测试有差异吗？

kkkls commented 11 months ago

1e-7

感谢您的回答，请问训练完成后您是如何选取最优的训练文件呢？以及我选择latest文件得到的测试结果和训练log中的输出结果差异较大，测试只得到22的psnr值，log文件中对应的iter值为32，是test.sh文件和训练过程中的测试有差异吗？

我们选择的是latest的model,你有用我们的测试代码来测试吗，我们的测试代码和训练中的PSNR差的会很小

kkkls commented 11 months ago

1e-7

感谢您的回答，请问训练完成后您是如何选取最优的训练文件呢？以及我选择latest文件得到的测试结果和训练log中的输出结果差异较大，测试只得到22的psnr值，log文件中对应的iter值为32，是test.sh文件和训练过程中的测试有差异吗？

我们选择的是latest的model,你有用我们的测试代码来测试吗，我们的测试代码和训练中的PSNR差的会很小

我的训练日志如下所示：取到latest文件进行测试如下命令：之后利用：python scripts/metrics/calculate_psnr_ssim.py --gt /mnt/sda/cds/GoPro/test/groundtruth/ --restored /mnt/sda/cds/FFT_mul/results/fftformer/GoPro_mytrain_30w_128_latestpth� 计算psnr结果：发现差异较大。若您方便的话能否加我qq呀？qq号：1694104454；十分感谢您之前的回答，也期待能成功follow您的工作！

你可以加我微信微信图片_20231209165509

kkkls / FFTformer

Setting of GoPro training parameters #8