Closed lxg15066629402 closed 5 years ago
I see your loss is "nan". Try decrease learning rate, like 0.00001. What's the batch size and num of gpu you setting?
I sets the batch_size=4 and the GPU(num_work=2)
------------------ 原始邮件 ------------------ 发件人: "Nick Tsai"notifications@github.com; 发送时间: 2019年9月16日(星期一) 下午2:49 收件人: "nitsaick/kits19-challenge"kits19-challenge@noreply.github.com; 抄送: "Jone"it_lxg@qq.com; "Author"author@noreply.github.com; 主题: Re: [nitsaick/kits19-challenge] result (#2)
I see your loss is "nan". Try decrease learning rate, like 0.00001. What's the batch size and num of gpu you setting?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
They seem fine. Just try decrease learning rate. Make sure loss value is not nan.
ok, thank you!!
------------------ 原始邮件 ------------------ 发件人: "Nick Tsai"notifications@github.com; 发送时间: 2019年9月16日(星期一) 下午3:17 收件人: "nitsaick/kits19-challenge"kits19-challenge@noreply.github.com; 抄送: "Jone"it_lxg@qq.com; "Author"author@noreply.github.com; 主题: Re: [nitsaick/kits19-challenge] result (#2)
They seem fine. Just try decrease learning rate. Make sure loss value is not nan.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
I try decrease learning rate, but I find the loss value is nan yet.
------------------ 原始邮件 ------------------ 发件人: "Nick Tsai"notifications@github.com; 发送时间: 2019年9月16日(星期一) 下午3:17 收件人: "nitsaick/kits19-challenge"kits19-challenge@noreply.github.com; 抄送: "Jone"it_lxg@qq.com; "Author"author@noreply.github.com; 主题: Re: [nitsaick/kits19-challenge] result (#2)
They seem fine. Just try decrease learning rate. Make sure loss value is not nan.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Is the loss nan at the first iteration or first epoch? Or the loss has normal value at the beginning? Can you give me your environment detail like Pytorch, Python version, argument, OS, and which GPU you use? I will try to reproduce your problem.
Thank you for reply!! I set the relevant parameters, python == 3.6 Pytorch == 1.1.0, when starting training loss = Nan Running command: Python train_res_unet.py -e 100 -b 4 -l 0.00001 -g 2 -s 512 512 --data "data" --logdir "run/ResUNet" --eval_intvl 5 --vis_intvl 0 --num_workers 2. Thank you!!
------------------ 原始邮件 ------------------ 发件人: "Nick Tsai"notifications@github.com; 发送时间: 2019年9月19日(星期四) 凌晨0:24 收件人: "nitsaick/kits19-challenge"kits19-challenge@noreply.github.com; 抄送: "Jone"it_lxg@qq.com; "Author"author@noreply.github.com; 主题: Re: [nitsaick/kits19-challenge] result (#2)
Is the loss nan at the first iteration or first epoch? Or the loss has normal value at the beginning? Can you give me your environment detail like Pytorch, Python version, argument, OS, and which GPU you use? I will try to reproduce your problem.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
I try to reproduce your problem. I test the code on my lab's 3 servers (GTX 1060, GTX 1080Ti, RTX 2080Ti) with different CUDA (9.0, 9.1, 10.0) and PyTorch version (1.1, 1.2). But I get the correct result on all environment. I cannot find the problem.
Thank, my lab's servers (GTX 1080) with CUDA(8.0), I re-tested and found that after being sent to the network, the data will become Nan value. I think there is a problem with the network reading data. Thank your for reply.
------------------ 原始邮件 ------------------ 发件人: "Nick Tsai"notifications@github.com; 发送时间: 2019年10月18日(星期五) 下午4:57 收件人: "nitsaick/kits19-challenge"kits19-challenge@noreply.github.com; 抄送: "Jone"it_lxg@qq.com; "Author"author@noreply.github.com; 主题: Re: [nitsaick/kits19-challenge] result (#2)
I try to reproduce your problem. I test the code on my lab's 3 servers (GTX 1060, GTX 1080Ti, RTX 2080Ti) with different CUDA (9.0, 9.1, 10.0) and PyTorch version (1.1, 1.2). But I get the correct result on all environment. I cannot find the problem.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.
The train result is bottom,but I can't know what is issue? ------------- Epoch 1/100 -------------- Learning rate: 0.0001 train: 100%|###########################################################################################| 8128/8128 [1:33:07<00:00, 1.49it/s, loss=nan] Best epoch: 1 Best score: 0.00000 ------------- Epoch 2/100 -------------- Learning rate: 0.0001 train: 100%|###########################################################################################| 8128/8128 [1:31:55<00:00, 1.48it/s, loss=nan] Best epoch: 1 Best score: 0.00000 ------------- Epoch 3/100 -------------- Learning rate: 0.0001 train: 100%|###########################################################################################| 8128/8128 [1:32:11<00:00, 1.46it/s, loss=nan] Best epoch: 1 Best score: 0.00000 ------------- Epoch 4/100 -------------- Learning rate: 0.0001 train: 100%|###########################################################################################| 8128/8128 [1:32:33<00:00, 1.47it/s, loss=nan] Best epoch: 1 Best score: 0.00000 ------------- Epoch 5/100 -------------- Learning rate: 0.0001 train: 100%|###########################################################################################| 8128/8128 [1:32:54<00:00, 1.46it/s, loss=nan] eval/train: 100%|####################################################################################################| 147/147 [56:08<00:00, 13.30s/it] train/dc_global_0: 0.99566 train/dc_global_1: 0.00000 train/dc_per_case_0: 0.99520 train/dc_per_case_1: 0.00000 eval/valid: 100%|######################################################################################################| 63/63 [22:02<00:00, 16.13s/it] valid/dc_global_0: 0.99598 valid/dc_global_1: 0.00000 valid/dc_per_case_0: 0.99486 valid/dc_per_case_1: 0.00000 Train data score: 0.00000 Valid data score: 0.00000 Best epoch: 1 Best score: 0.00000