tusimple training stuck at epoch #1

harryhan618 / SCNN_Pytorch

Pytorch implementation of "Spatial As Deep: Spatial CNN for Traffic Scene Understanding"

MIT License

250 stars 68 forks source link

tusimple training stuck at epoch #1 #32

Closed alchemz closed 4 years ago

alchemz commented 4 years ago

Hi Harry,

I have checked all the existed issues, and found no solution for this problem. The issue is that when launching tusimple training with the train.py, after train epoch #0, val epoch #0, it will always stuck at train epoch #1. And I wonder the results you get from the readme.md is also from only 1 epoch?

The following is the tensorboard, and you can see there is no logs from val loss. Is it normal? Screenshot from 2020-03-04 11-07-44

alchemz commented 4 years ago

After setting num_workers=0, this issue of hanging after epoch 1 gets resolved. But there is still no plot showing for val_loss until several epochs later.

@harryhan618 Could you help explain why would you put tensorboard.scalar_summary("val_loss", val_loss, iter_idx) outside of the for loop for line 184 in train.py?

harryhan618 commented 4 years ago

@alchemz hi！ I put tensorboard logging outside of the loop during validation, because during validation, I just want to know the loss situation for the whole epoch. Single iteration during validation is meaningless.