princeton-vl / RAFT-Stereo

MIT License
721 stars 139 forks source link

A lot of nan value in n_predictions #66

Open rebecca0011 opened 1 year ago

rebecca0011 commented 1 year ago

I use the rectified YAV images to test the model, and got this error information: Traceback (most recent call last): File "/home/rc/StereoMatching/RAFT-Stereo/train_stereo.py", line 256, in <module> train(args) File "/home/rc/StereoMatching/RAFT-Stereo/train_stereo.py", line 167, in train loss, metrics = sequence_loss(flow_predictions, flow, valid) File "/home/rc/StereoMatching/RAFT-Stereo/train_stereo.py", line 50, in sequence_loss assert not torch.isnan(flow_preds[i]).any() and not torch.isinf(flow_preds[i]).any() AssertionError I debug the progrom and found a lot of nan value in n_predictions. Could you plz give me some advice?

lahavlipson commented 1 year ago

This can happen when training with mixed precision. Two solutions I've found to work:

1) Use full precision. This will use ~2x as much memory, though.

2) Clip large gradients midway through the backward pass. You can do this by wrapping convolutions with this function:

import torch
import torch.nn as nn
import torch.nn.functional as F

GRAD_CLIP = .01

class GradClip(torch.autograd.Function):

    @staticmethod
    def forward(ctx, x):
        return x

    @staticmethod
    def backward(ctx, grad_x):
        o = torch.zeros_like(grad_x)
        grad_x = torch.where(grad_x.abs()>GRAD_CLIP, o, grad_x)
        grad_x = torch.where(torch.isnan(grad_x), o, grad_x)
        return grad_x