Only return 1 class when training COVIDNet with COVIDx4 dataset

gwang-kim commented 3 years ago

Description

Only return 1 class when finetuning COVIDNet and the loss is exploded.

Steps to Reproduce

I downloaded the COVIDx4 Dataset and tried training COVIDNet with COVIDx4 dataset. However, when inference, the model returned only one class and the performance was poor. I reported the loss every step during the training and the loss was exploded to several thousands after 1 epoch. I try both training from scratch and fine-tuning. How can I train your model stably?

Expected behavior

the model is trained stably

Actual behavior

the model returned only one class and the performance was poor.

Environment

Ubuntu 18.04 tensorflowgpu 1.15 And I followed the requirements.txt

sabuj7177 commented 3 years ago

Hi @gwang-kim , Have you found any solution of this issue? I am also facing similar issue.

This is the training log for starting epochs:

Output: ./output/COVIDNet-lr0.0002 Dataset length 15952 13794 2158 Saved baseline checkpoint Baseline eval: [[194. 6.] [ 9. 191.]] Sens Negative: 0.970, Positive: 0.955 PPV Negative: 0.956, Positive: 0.970 Training started 1725/1725 [==============================] - 2445s 1s/step Epoch: 0001 Minibatch loss= 370.629089355 [[200. 0.] [200. 0.]] Sens Negative: 1.000, Positive: 0.000 PPV Negative: 0.500, Positive: 0.000 Saving checkpoint at epoch 1 1725/1725 [==============================] - 4858s 3s/step Epoch: 0002 Minibatch loss= 3805.902343750 [[199. 1.] [200. 0.]] Sens Negative: 0.995, Positive: 0.000 PPV Negative: 0.499, Positive: 0.000 Saving checkpoint at epoch 2 1725/1725 [==============================] - 7348s 4s/step Epoch: 0003 Minibatch loss= 12214.270507812 [[195. 5.] [199. 1.]] Sens Negative: 0.975, Positive: 0.005 PPV Negative: 0.495, Positive: 0.167 Saving checkpoint at epoch 3 1725/1725 [==============================] - 9727s 6s/step Epoch: 0004 Minibatch loss= 28461.550781250 [[200. 0.] [200. 0.]] Sens Negative: 1.000, Positive: 0.000 PPV Negative: 0.500, Positive: 0.000 Saving checkpoint at epoch 4

This is the command i used(it is from training instruction):

python train_tf.py \ --weightspath models/COVIDNet-CXR-2 \ --metaname model.meta \ --ckptname model \ --n_classes 2 \ --trainfile labels/train_COVIDx8B.txt \ --testfile labels/test_COVIDx8B.txt \ --out_tensorname norm_dense_2/Softmax:0 \ --logit_tensorname norm_dense_2/MatMul:0

Environment: Ubuntu 20.04LTS tensorflow-gpu

I build the dataset by following the dataset generation instructions.

Thank you.

gwang-kim commented 3 years ago

Hi @sabuj7177, Oh, it's almost the same situation as mine. Unfortunately, I didn't found any solution.. I controlled the hyperparams such as LR, but it didn't work.

If you solve the problem, please let me know!

Thank you

sabuj7177 commented 3 years ago

Hi @lindawangg @haydengunraj, Can you please suggest any workaround of this issue? Can you please suggest what I am doing wrong?

SmallFan7 commented 2 years ago

Hi @gwang-kim , Have you found any solution of this issue? I am also facing similar issue.

This is the training log for starting epochs:

Output: ./output/COVIDNet-lr0.0002 Dataset length 15952 13794 2158 Saved baseline checkpoint Baseline eval: [[194. 6.] [ 9. 191.]] Sens Negative: 0.970, Positive: 0.955 PPV Negative: 0.956, Positive: 0.970 Training started 1725/1725 [==============================] - 2445s 1s/step Epoch: 0001 Minibatch loss= 370.629089355 [[200. 0.] [200. 0.]] Sens Negative: 1.000, Positive: 0.000 PPV Negative: 0.500, Positive: 0.000 Saving checkpoint at epoch 1 1725/1725 [==============================] - 4858s 3s/step Epoch: 0002 Minibatch loss= 3805.902343750 [[199. 1.] [200. 0.]] Sens Negative: 0.995, Positive: 0.000 PPV Negative: 0.499, Positive: 0.000 Saving checkpoint at epoch 2 1725/1725 [==============================] - 7348s 4s/step Epoch: 0003 Minibatch loss= 12214.270507812 [[195. 5.] [199. 1.]] Sens Negative: 0.975, Positive: 0.005 PPV Negative: 0.495, Positive: 0.167 Saving checkpoint at epoch 3 1725/1725 [==============================] - 9727s 6s/step Epoch: 0004 Minibatch loss= 28461.550781250 [[200. 0.] [200. 0.]] Sens Negative: 1.000, Positive: 0.000 PPV Negative: 0.500, Positive: 0.000 Saving checkpoint at epoch 4

This is the command i used(it is from training instruction):

python train_tf.py --weightspath models/COVIDNet-CXR-2 --metaname model.meta --ckptname model --n_classes 2 --trainfile labels/train_COVIDx8B.txt --testfile labels/test_COVIDx8B.txt --out_tensorname norm_dense_2/Softmax:0 --logit_tensorname norm_dense_2/MatMul:0

Environment: Ubuntu 20.04LTS tensorflow-gpu

I build the dataset by following the dataset generation instructions.

Thank you.

I have the same situation as you. Is the problem solved now？

gwang-kim commented 2 years ago

@SmallFan7 Not yet, I think it's just the limitation of this work.

lindawangg / COVID-Net