Issue on the data weight while training the nodule detector net

juliandewit / kaggle_ndsb2017

Kaggle datascience bowl 2017

MIT License

622 stars 290 forks source link

Issue on the data weight while training the nodule detector net #10

Closed CyranoChen closed 7 years ago

CyranoChen commented 7 years ago

Hello, Julian,

Thank you for offering the code and great work.

I've a question on the data weight of trainset.

I notice that there are several data sources in the trainset such as labels from lidc, v2 from luna16, luna16 false positive, ndsb and non-lung tissue edge. Among them, lidc and nodules of luna16 should be the positive samples, the others are negative samples (the labels for them are 0,0).

But the negative samples are far more than positive ones. It is unbalanced. How about the rates of combing the trainset. I think 1(positive) : 1(false positive) : 2(non-lung tissue or edge) maybe make sense, because too many negative samples would dilute the accuracy.

Would you please give me some suggestions on this issue?

CyranoChen commented 7 years ago

I notice the weight column in this table. Could you help me explain this and how to use it?

juliandewit commented 7 years ago

Hello CyranoChen, The csv's are in resources.rar https://retinopaty.blob.core.windows.net:443/ndsb3/resources.rar

In train_nodule_detector.py I upsample positive examples to approximately 1:20 pos:neg. This is still heavily imbalanced but no problem because of 2 reasons.

It's more important that the network "gets" what you are trying to learn it. Once it recognizes pos/neg examples it will not make big mistakes and therefore take almost no loss. The balance is only a problem if it does not know what its doing. For instance when you just start with training.
The results of the nodule detector are fed to the second phase. If the resulting chances/malignancies als consisently lower (say from 0-0.5 instead of from 0-1) the 2nd level classifier that needs to predict the chance of cancer within one year will just adjust its parameters to these lower values.

CyranoChen commented 7 years ago

Thank you for your reply.

For 1:20 pos:neg, the neg samples are several different types, how about their rate?

juliandewit commented 7 years ago

The rate for negative example types is in accordance with the amount of records in the different csv files. Some positive example types are boosted a bit. You can see that happening in the code,

CyranoChen commented 7 years ago

As you told and according to the code, do you put all the data as the table into training net?

27646185-9228e952-5c5a-11e7-9471-e918b50cbaf0

There are above 550,000 samples into the net.

jzhanglab commented 7 years ago

Hi, Julian,

Thank you for your great work and making the code open for the public.

Now I want to estimate the accuracy of the nodule prediction of the code on the nsdb3 set, but it seems that there is no such information for downloading for this set. Is that true?

Thanks a lot. Your time is greatly appreciated.

Jian

juliandewit commented 7 years ago

Hello, You can download the NDSB3 data from Kaggle. Just register, go to the competition and download the data. Regards, Julian.

CyranoChen commented 7 years ago

Thank you for your reply.

monjoybme commented 5 years ago

@juliandewit do you have NDSB-2017 label datasets? In your "resources.rar" file I can see ndsb3_manual_labels file. Is that the actual labels for NDSB-2017 https://www.kaggle.com/c/data-science-bowl-2017)?

juliandewit commented 5 years ago

I'm afraid the labels were provided by kaggle but have been removed.