Open gwang-kim opened 3 years ago
Hi @gwang-kim , Have you found any solution of this issue? I am also facing similar issue.
This is the training log for starting epochs:
Output: ./output/COVIDNet-lr0.0002 Dataset length 15952 13794 2158 Saved baseline checkpoint Baseline eval: [[194. 6.] [ 9. 191.]] Sens Negative: 0.970, Positive: 0.955 PPV Negative: 0.956, Positive: 0.970 Training started 1725/1725 [==============================] - 2445s 1s/step Epoch: 0001 Minibatch loss= 370.629089355 [[200. 0.] [200. 0.]] Sens Negative: 1.000, Positive: 0.000 PPV Negative: 0.500, Positive: 0.000 Saving checkpoint at epoch 1 1725/1725 [==============================] - 4858s 3s/step Epoch: 0002 Minibatch loss= 3805.902343750 [[199. 1.] [200. 0.]] Sens Negative: 0.995, Positive: 0.000 PPV Negative: 0.499, Positive: 0.000 Saving checkpoint at epoch 2 1725/1725 [==============================] - 7348s 4s/step Epoch: 0003 Minibatch loss= 12214.270507812 [[195. 5.] [199. 1.]] Sens Negative: 0.975, Positive: 0.005 PPV Negative: 0.495, Positive: 0.167 Saving checkpoint at epoch 3 1725/1725 [==============================] - 9727s 6s/step Epoch: 0004 Minibatch loss= 28461.550781250 [[200. 0.] [200. 0.]] Sens Negative: 1.000, Positive: 0.000 PPV Negative: 0.500, Positive: 0.000 Saving checkpoint at epoch 4
This is the command i used(it is from training instruction):
python train_tf.py \ --weightspath models/COVIDNet-CXR-2 \ --metaname model.meta \ --ckptname model \ --n_classes 2 \ --trainfile labels/train_COVIDx8B.txt \ --testfile labels/test_COVIDx8B.txt \ --out_tensorname norm_dense_2/Softmax:0 \ --logit_tensorname norm_dense_2/MatMul:0
Environment: Ubuntu 20.04LTS tensorflow-gpu
I build the dataset by following the dataset generation instructions.
Thank you.
Hi @sabuj7177, Oh, it's almost the same situation as mine. Unfortunately, I didn't found any solution.. I controlled the hyperparams such as LR, but it didn't work.
If you solve the problem, please let me know!
Thank you
Hi @lindawangg @haydengunraj, Can you please suggest any workaround of this issue? Can you please suggest what I am doing wrong?
Hi @gwang-kim , Have you found any solution of this issue? I am also facing similar issue.
This is the training log for starting epochs:
Output: ./output/COVIDNet-lr0.0002 Dataset length 15952 13794 2158 Saved baseline checkpoint Baseline eval: [[194. 6.] [ 9. 191.]] Sens Negative: 0.970, Positive: 0.955 PPV Negative: 0.956, Positive: 0.970 Training started 1725/1725 [==============================] - 2445s 1s/step Epoch: 0001 Minibatch loss= 370.629089355 [[200. 0.] [200. 0.]] Sens Negative: 1.000, Positive: 0.000 PPV Negative: 0.500, Positive: 0.000 Saving checkpoint at epoch 1 1725/1725 [==============================] - 4858s 3s/step Epoch: 0002 Minibatch loss= 3805.902343750 [[199. 1.] [200. 0.]] Sens Negative: 0.995, Positive: 0.000 PPV Negative: 0.499, Positive: 0.000 Saving checkpoint at epoch 2 1725/1725 [==============================] - 7348s 4s/step Epoch: 0003 Minibatch loss= 12214.270507812 [[195. 5.] [199. 1.]] Sens Negative: 0.975, Positive: 0.005 PPV Negative: 0.495, Positive: 0.167 Saving checkpoint at epoch 3 1725/1725 [==============================] - 9727s 6s/step Epoch: 0004 Minibatch loss= 28461.550781250 [[200. 0.] [200. 0.]] Sens Negative: 1.000, Positive: 0.000 PPV Negative: 0.500, Positive: 0.000 Saving checkpoint at epoch 4
This is the command i used(it is from training instruction):
python train_tf.py --weightspath models/COVIDNet-CXR-2 --metaname model.meta --ckptname model --n_classes 2 --trainfile labels/train_COVIDx8B.txt --testfile labels/test_COVIDx8B.txt --out_tensorname norm_dense_2/Softmax:0 --logit_tensorname norm_dense_2/MatMul:0
Environment: Ubuntu 20.04LTS tensorflow-gpu
I build the dataset by following the dataset generation instructions.
Thank you.
I have the same situation as you. Is the problem solved now?
@SmallFan7 Not yet, I think it's just the limitation of this work.
Description
Only return 1 class when finetuning COVIDNet and the loss is exploded.
Steps to Reproduce
I downloaded the COVIDx4 Dataset and tried training COVIDNet with COVIDx4 dataset. However, when inference, the model returned only one class and the performance was poor. I reported the loss every step during the training and the loss was exploded to several thousands after 1 epoch. I try both training from scratch and fine-tuning. How can I train your model stably?
Expected behavior
the model is trained stably
Actual behavior
the model returned only one class and the performance was poor.
Environment
Ubuntu 18.04 tensorflowgpu 1.15 And I followed the requirements.txt