Issue about the negative data and label

juliandewit / kaggle_ndsb2017

Kaggle datascience bowl 2017

MIT License

624 stars 290 forks source link

Issue about the negative data and label #8

Closed ypflll closed 7 years ago

ypflll commented 7 years ago

Hi, julian, I am trying to build a nodule detector based on you job, and thanks very much for your sharing. May I ask some questions:

You use several types of training set: labels from lidc, v2 from luna16, luna16 false positive, ndsb and non-lung tissue edge. So, on the train stage, except the non-lung tissue edge, the others are all positive sample? and the label for the positive sample is YES(to say if the cube contains a nodule) for positive samples, and NO for non-lung tissue edge, right?
Another question is: When predicting, a 646464 cube is get to the net, the result is if the cub contains a nodule and the probability? Any information will be welcomed!

juliandewit commented 7 years ago

Hello.. candidatesv2 are also negative examples. (there are around 400.000 negatives there) Basically that is the most important source of negatives. The edge examples only let the network know that non-lung-tissue is also not a lung nodule. Another (small) source of negatives are the false positives that were predicted after one round of training on LUNA16.

The networks learns 2 things at once 1: Lung nodule y/n.. (non lung tissue should always be n) 2: Malignancy (0 if not a lung nodule, 0.1-25 if lung nodule).

I train/predict 32x32x32 cubes. The prediction is nodule Y/N. If Yes then I also look at the malignancy.. Malignancy is the only thing I work with for the final prediction.

I hope that makes things clearer. It's quite a complex solution with all the different label sources.

ypflll commented 7 years ago

Quite clear and really a complicated and refined work..

But the question is where is candidate v2 from? In step1_preprocess_luna16.py, seems that you generate your negative samples from two files: lidc.xml and annotation_excluded.csv（it's candidate.csv?）. So, where they are from? If I don't have such files in my case, I should cut lung-tissue cubes randomly (does not contain a nodule) manually as negative samples?

juliandewit commented 7 years ago

In the resources folder there is a link to "resources.rar" in the readme.md. This file contains all the data you need and even more.

In the resources.rar there is a folder "luna16_annotations". In that folder there is candidatesv2.csv. This file is directly taken from the LUNA16 competition. Look here for more: LUNA16 data

ypflll commented 7 years ago

Got it. I am in Tianchi (a competition held by Alibaba, China). In my case, only nodules information were given. Seems that I need train a 3d unet to generate false positive samples firstly.

juliandewit commented 7 years ago

Hi I looked at the competition.. My chinese is not too good :S

I do think this approach can be translated to that competition since the #6 team of the datascience bowl is #1 now at your competition.

Good luck!

ypflll commented 7 years ago

Tianchi's english version lacks important information-_-||

4th place in kaggle is now the first place in Tianchi. So we need to do more. Thanks.