I have managed to adopt your code for training on S3DIS dataset successfully. However, when I shifted it to my own dataset, some Nan loss appears at the first few iterations. Sometimes, it produced nan loss after the correct loss value for 1~3 iterations (in 1st epoch), while sometimes the nan loss is produced directly at the first iteration (in 1st epoch). Different learning rates & num_points values are tested, with no luck. Then I used "torch.autograd.set_detect_anomaly(True)" to detect the abnormal gradient values in network, and received an error message below. I have spent some time working on this, but still not have a clue yet. Could you please give me some instructions/comments if possible? Thanks in advance!
Error Message
train.py: loss += criterion['discriminative'](embedded, masks, size)
discriminative.py: norm = torch.norm(mu, 2, dim=1)
RuntimeError: Function 'NormBackward1' returned nan values in its 0th output.
Differences between my dataset and S3DIS:
maximum num_instances per sample: less than 10
num_classes: less than 30
point sampling: random sample 4096 or 8192 points (out of 10e4-10e6 points) for each sample (rather than splitting into blocks at first)
Hi Quang-Hieu, Thanks for providing the code.
I have managed to adopt your code for training on S3DIS dataset successfully. However, when I shifted it to my own dataset, some Nan loss appears at the first few iterations. Sometimes, it produced nan loss after the correct loss value for 1~3 iterations (in 1st epoch), while sometimes the nan loss is produced directly at the first iteration (in 1st epoch). Different learning rates & num_points values are tested, with no luck. Then I used "torch.autograd.set_detect_anomaly(True)" to detect the abnormal gradient values in network, and received an error message below. I have spent some time working on this, but still not have a clue yet. Could you please give me some instructions/comments if possible? Thanks in advance!
Error Message
Differences between my dataset and S3DIS: maximum num_instances per sample: less than 10 num_classes: less than 30 point sampling: random sample 4096 or 8192 points (out of 10e4-10e6 points) for each sample (rather than splitting into blocks at first)