megvii-research / CREStereo

Official MegEngine implementation of CREStereo(CVPR 2022 Oral).
Apache License 2.0
481 stars 59 forks source link

nan #23

Open jim88481 opened 2 years ago

jim88481 commented 2 years ago

2022/06/01 14:17:17 Model params saved: train_logs/models/epoch-1.mge 2022/06/01 14:17:25 0.66 b/s,passed:00:13:16,eta:21:41:36,data_time:0.16,lr:0.0004,[2/100:5/500] ==> loss:26.19 2022/06/01 14:17:32 0.65 b/s,passed:00:13:24,eta:21:40:40,data_time:0.17,lr:0.0004,[2/100:10/500] ==> loss:6.847 2022/06/01 14:17:40 0.68 b/s,passed:00:13:31,eta:21:39:57,data_time:0.14,lr:0.0004,[2/100:15/500] ==> loss:6.83 2022/06/01 14:17:47 0.67 b/s,passed:00:13:39,eta:21:39:12,data_time:0.16,lr:0.0004,[2/100:20/500] ==> loss:16.89 2022/06/01 14:17:55 0.66 b/s,passed:00:13:46,eta:21:38:28,data_time:0.17,lr:0.0004,[2/100:25/500] ==> loss:43.18 2022/06/01 14:18:02 0.66 b/s,passed:00:13:54,eta:21:37:36,data_time:0.17,lr:0.0004,[2/100:30/500] ==> loss:20.37 2022/06/01 14:18:10 0.65 b/s,passed:00:14:01,eta:21:36:52,data_time:0.18,lr:0.0004,[2/100:35/500] ==> loss:15.24 2022/06/01 14:18:17 0.65 b/s,passed:00:14:09,eta:21:36:18,data_time:0.19,lr:0.0004,[2/100:40/500] ==> loss:9.399 2022/06/01 14:18:25 0.67 b/s,passed:00:14:16,eta:21:35:41,data_time:0.16,lr:0.0004,[2/100:45/500] ==> loss:40.27 2022/06/01 14:18:32 0.68 b/s,passed:00:14:24,eta:21:34:58,data_time:0.14,lr:0.0004,[2/100:50/500] ==> loss:15.02 2022/06/01 14:18:40 0.69 b/s,passed:00:14:31,eta:21:34:14,data_time:0.14,lr:0.0004,[2/100:55/500] ==> loss:32.48 2022/06/01 14:18:47 0.65 b/s,passed:00:14:39,eta:21:33:42,data_time:0.18,lr:0.0004,[2/100:60/500] ==> loss:9.96 2022/06/01 14:18:55 0.65 b/s,passed:00:14:46,eta:21:33:16,data_time:0.18,lr:0.0004,[2/100:65/500] ==> loss:14.69 2022/06/01 14:19:02 0.68 b/s,passed:00:14:54,eta:21:32:35,data_time:0.13,lr:0.0004,[2/100:70/500] ==> loss:nan 2022/06/01 14:19:10 0.65 b/s,passed:00:15:01,eta:21:31:55,data_time:0.19,lr:0.0004,[2/100:75/500] ==> loss:nan 2022/06/01 14:19:17 0.68 b/s,passed:00:15:09,eta:21:31:14,data_time:0.15,lr:0.0004,[2/100:80/500] ==> loss:nan 2022/06/01 14:19:25 0.67 b/s,passed:00:15:16,eta:21:30:34,data_time:0.15,lr:0.0004,[2/100:85/500] ==> loss:nan 2022/06/01 14:19:32 0.67 b/s,passed:00:15:24,eta:21:30:08,data_time:0.17,lr:0.0004,[2/100:90/500] ==> loss:nan 2022/06/01 14:19:40 0.69 b/s,passed:00:15:31,eta:21:29:28,data_time:0.14,lr:0.0004,[2/100:95/500] ==> loss:nan 2022/06/01 14:19:47 0.65 b/s,passed:00:15:39,eta:21:28:54,data_time:0.17,lr:0.0004,[2/100:100/500] ==> loss:nan 2022/06/01 14:19:55 0.68 b/s,passed:00:15:46,eta:21:28:11,data_time:0.14,lr:0.0004,[2/100:105/500] ==> loss:nan 2022/06/01 14:20:02 0.65 b/s,passed:00:15:54,eta:21:27:38,data_time:0.17,lr:0.0004,[2/100:110/500] ==> loss:nan 2022/06/01 14:20:10 0.64 b/s,passed:00:16:01,eta:21:27:04,data_time:0.2,lr:0.0004,[2/100:115/500] ==> loss:nan 2022/06/01 14:20:17 0.67 b/s,passed:00:16:09,eta:21:26:28,data_time:0.16,lr:0.0004,[2/100:120/500] ==> loss:nan 2022/06/01 14:20:25 0.66 b/s,passed:00:16:16,eta:21:26:04,data_time:0.17,lr:0.0004,[2/100:125/500] ==> loss:nan 2022/06/01 14:20:32 0.68 b/s,passed:00:16:24,eta:21:25:20,data_time:0.15,lr:0.0004,[2/100:130/500] ==> loss:nan

hello! this is my train logs,why?

WenjiaR commented 2 years ago

Hi, have you solved this issue? I encountered this problem when training on Sceneflow.

jim88481 commented 2 years ago

Hi, have you solved this issue? I encountered this problem when training on Sceneflow.

yes,I solved it by https://github.com/ibaiGorordo/CREStereo-Pytorch and,there is not much difference in their effectiveness after 500 epoch

jim88481 commented 2 years ago

Hi, have you solved this issue? I encountered this problem when training on Sceneflow.

besides,I think it's the lack of memory that causes it NAN.You can try to solve this problem

WenjiaR commented 2 years ago

Hi, have you solved this issue? I encountered this problem when training on Sceneflow.

yes,I solved it by https://github.com/ibaiGorordo/CREStereo-Pytorch and,there is not much difference in their effectiveness after 500 epoch

Thank you for your reply! I will try in this way.

deephog commented 2 years ago

Hi, have you solved this issue? I encountered this problem when training on Sceneflow.

besides,I think it's the lack of memory that causes it NAN.You can try to solve this problem

Were you able to reproduce their performance with the pytorch implementation? I tried that repo you mentioned, but I'm still suffering from Nan loss after some epochs. If you were able to reproduce, what datasets did you use? Please specify the sub-datasets such as "monkaa", and "clean" or "final" versions you used. Thanks!

jim88481 commented 2 years ago

@deephog I used https://github.com/ibaiGorordo/CREStereo-Pytorch and solved NAN. I use the datasets of Baidu Web disk provided by the author (Download from BaiduCloud here(Extraction code: aa3g) and extract the tar files manually).This is the result after 200 epochs 1

deephog commented 2 years ago

@deephog I used https://github.com/ibaiGorordo/CREStereo-Pytorch and solved NAN. I use the datasets of Baidu Web disk provided by the author (Download from BaiduCloud here(Extraction code: aa3g) and extract the tar files manually).This is the result after 200 epochs 1

Did you compare the final results to the pre-trained model they provide? I can get similar results, but I can never get as good as theirs

jim88481 commented 2 years ago

@deephog I used https://github.com/ibaiGorordo/CREStereo-Pytorch and solved NAN. I use the datasets of Baidu Web disk provided by the author (Download from BaiduCloud here(Extraction code: aa3g) and extract the tar files manually).This is the result after 200 epochs 1

Did you compare the final results to the pre-trained model they provide? I can get similar results, but I can never get as good as theirs

I'm sorry,This is what I did a few months ago,I only remember that after 500 epochs the results were more or less adequate for my needs.However, in general, the author's pre-trained model is the best, and it is normal that you cannot achieve the author's results.