shartoo / luna16_multi_size_3dcnn

An implement of paper "Multi-level Contextual 3D CNNs for False Positive Reduction in Pulmonary Nodule Detection"
111 stars 45 forks source link

Attention

DON'T STAR THIS REPO ANYMORE,IT'S A BAD IMPLEMENT

luna16_multi_size_3dcnn

An implement of paper "Multi-level Contextual 3D CNNs for False Positive Reduction in Pulmonary Nodule Detection"

The detail about the paper can be found luna16 3DCNN

0 required

1 data

Or you can download from official website

1.1 data overview

The original data from luna16 are consist of below:

As you can know ,the positive sample data (annotations.csv) and the false sample data(candidates_V2.csv) are already annotated .What we need to do is just extracting them from medical format(sth like CT) to images.There is no need to worry about positive/negative data.

annotations.csv

seriesuid coordX coordY coordZ diameter_mm
1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860 -128.6994211 -175.3192718 -298.3875064 5.651470635
1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860 103.7836509 -211.9251487 -227.12125 4.224708481
1.3.6.1.4.1.14519.5.2.1.6279.6001.100398138793540579077826395208 69.63901724 -140.9445859 876.3744957 5.786347814
1.3.6.1.4.1.14519.5.2.1.6279.6001.100621383016233746780170740405 -24.0138242 192.1024053 -391.0812764 8.143261683

unit of coordX,coordY,coordZ,diameter_mm are mm and there are 1187 lines in this csv file.

candidates.csv

seriesuid coordX coordY coordZ class
1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860 68.42 -74.48 -288.7 0
1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860 -95.20936148 -91.80940617 -377.4263503 0
1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860 -24.76675476 -120.3792939 -273.3615387 0
1.3.6.1.4.1.14519.5.2.1.6279.6001.100225287222365663678666836860 -63.08 -65.74 -344.24 0

Value of class column means postive(1) or negative(0). There are 754976 lines in this csv file.

The positive/negative sample ratio is 1187 vs 754976 ,nearly 1:636. Data enhancement is essential.

1.2 how to prepare data

We have center coordinates and diameter of every true positive nodule and huge number of false positive candidates(center coodinates without diameter),it's rather clear what we need to do,just extracting them out with multiscale method.

The paper imply that scale below are appropriate

As positive are annotated with diameters while negative not,we are using a simple and rude method to extract cubes on every nodule(both for real and fake ones).

There is a better way preparing positive sample .An idea borrowed from objection and location such as SSD or FasterRCNN is bounding box generation.We can generate cubes sliding whole 3D CT space and keep cubes whose IOU are greater than a threshold like 0.7 in FasterRCNN as positive samples . This idea comes from a teacher from Shanghai Jiaotong University.I'll implement soon.

1.3 Data enhancement

2 process step

First run data_prepare.py to extract cubic(both real nodule and fake ones) from raw CT files. This may take hours and the output of this step is

the total size of those file is around 100GB and take one night in my PC(16GB RAM,i5),please leave enough disk. There will be some ValueError like:

<class 'Exception'> : could not broadcast input array from shape (40,40,25) into shape (40,40,26)
  File "H:/workspace/luna16_multi_size_3dcnn/data_prepare.py", line 142, in extract_fake_cubic_from_mhd
    int(v_center[2] - 13):int(v_center[2] + 13)]
ValueError: could not broadcast input array from shape (40,40,25) into shape (40,40,26)
Traceback (most recent call last):

It's ok to go cause not all false positive candidates are need,reading the csv files and you'll know false positive data are much more than positive data.

Then run main.py to train model,inference step will be ran as follow,this step is rather slow cause of huge number of data.