sarafridov / K-Planes

Other
486 stars 46 forks source link

Training on Custom Data #26

Closed LzqInsta closed 1 year ago

LzqInsta commented 1 year ago

Thanks for your great work. I try to run K-Planes my own dataset, that is, video sequences from Neuman dataset. However, I got nan loss after about 10000 steps. Here is my training config file, which is modified from dynerf_hybrid config. I only change the scene_box and training steps. Would you give me some advice about which loss or operation of K-Planes may cause the invalid nan output?

config = {
 'expname': 'neuman_bike',
 'logdir': './logs_neuman/bike',
 'device': 'cuda:0',

 # Run first for 1 step with data_downsample=4 to generate weights for ray importance sampling
 'data_downsample': 1,
 'data_dirs': ['/opt/data/llff_data/neuman_bike/'],
 'contract': False,
 'ndc': True,
 'ndc_far': 2.6,
 'near_scaling': 0.95,
 'isg': False,
 'isg_step': -1,
 'ist_step': 50000,
 'keyframes': False,
 'scene_bbox': [[
      -10.939920205538575,
      -2.0469914783289735,
      -1.0306140184402466
    ],
    [
      7.077569125469017,
      1.5071640571195142,
      12.159653639708578
    ]
 ],
 # Optimization settings
 'num_steps': 200001,
 'batch_size': 4096,
 'scheduler_type': 'warmup_cosine',
 'optim_type': 'adam',
 'lr': 0.004,

 # Regularization
 'distortion_loss_weight': 0.001,
 'histogram_loss_weight': 1.0,
 'l1_time_planes': 0.0001,
 'l1_time_planes_proposal_net': 0.0001,
 'plane_tv_weight': 0.0002,
 'plane_tv_weight_proposal_net': 0.0002,
 'time_smoothness_weight': 0.001,
 'time_smoothness_weight_proposal_net': 1e-05,

 # Training settings
 'save_every': 20000,
 'valid_every': 20000,
 'save_outputs': True,
 'train_fp16': False,

 # Raymarching settings
 'single_jitter': False,
 'num_samples': 48,
 'num_proposal_samples': [256, 128],
 'num_proposal_iterations': 2,
 'use_same_proposal_network': False,
 'use_proposal_weight_anneal': True,
 'proposal_net_args_list': [
  {'num_input_coords': 4, 'num_output_coords': 8, 'resolution': [128, 128, 128, 150]},
  {'num_input_coords': 4, 'num_output_coords': 8, 'resolution': [256, 256, 256, 150]}
 ],

 # Model settings
 'concat_features_across_scales': True,
 'density_activation': 'trunc_exp',
 'linear_decoder': False,
 'multiscale_res': [1, 2, 4, 8],
 'grid_config': [{
  'grid_dimensions': 2,
  'input_coordinate_dim': 4,
  'output_coordinate_dim': 16,
  'resolution': [64, 64, 64, 150]
 }],
}
sarafridov commented 1 year ago

What does the output look like before you start getting nans? My first guess would be that lr being too high could cause nans, in which case you'd likely see the psnr diverging before the nans appear. But if lowering the lr doesn't fix it, looking at pre-nan output (rendered train/test views, psnr plot, and feature planes) can help to diagnose what might be happening.

LzqInsta commented 1 year ago

Thanks for your kind reply. The training PSNR continues to increase before the nan loss output. But a smaller lr helps to slow down the emergence of nan loss. I will try different lr later. Before I got nans, the planeTV and time smooth loss decrease rapidly. Do you have some suggestions about the strange phenomenon? image

sarafridov commented 1 year ago

Hmm this is strange (I don't think I've seen behavior like this before). Given the strange behavior with the feature plane regularizers you might try adjusting (probably lowering) the weights on those? It might also be informative to look at the feature planes and renderings at different checkpoints during optimization (before and after this weird behavior starts), so you can see the effects of the regularizers. If things are getting very blurred or blocky that's another indicator that the regularizers are too high, or if you see any ghosting or floaters it's a sign that TV might be too low.

LzqInsta commented 1 year ago

You are Correct. The rendered images are very blurred. I will check out the hyperparams and check my dataloader code again. Thank for your kind reply. I will close this issue temporarily.