jspunda / prostatex

This is the repository of the final project for the course Intelligent Systems in Medical Imaging 2017
14 stars 7 forks source link

New HDF5 dataset #19

Closed jspunda closed 7 years ago

jspunda commented 7 years ago

The current DICOM + plus 2 .csv files setup is a little messy and cumbersome to work with. To extract lesions, we would first have to load all DICOM series of interest from disk. Then compare this with the information in the first .csv file (prostateX-images-train.csv) to get the lesion info (ijk, spacing, etc.). After that we have to load prostateX-findings-train.csv to obtain the zone and clinSig information.

We have to put all this information together in order to extract the right lesion, with the right truth label and zone information from the right DICOM series, before we can start training. That's why I decided to restructure the data and combine the DICOM pixel data and the two .csv files into one hdf5 dataset.

The code for this can be found in the h5_converter branch. There are, as of now, three files: csv_fix.py, h5_converter.py and h5_query.py. Csv_fix and h5_converter only have to be run once in order to actually build the hdf5 set (which I have already done). The way the set is structured can be found in h5_converter.py.

To actually retrieve something from the set we can use h5_query.py. It contains a class that lets us draw DICOM images and their lesion information very quickly. It's almost instant. Much faster than our old way of reading DICOM files from disk and then loading their pixel data.

Note that there is no actual lesion pixel data in the hdf5 set. Just the lesion attributes from the .csv files and the DICOM pixel data. Actually extracting the lesion pixel data from the DICOM pixel data should be much more straightforward with the query result from h5_query.py.

The new HDF5 dataset can be found at https://jspunda.stackstorage.com/s/0Zy95CMqQzwVaAq The password for the file is: ismi2017

Whether or not we are actually going to be using this new set of course depends on what everyone thinks, but in my opinion it will simplify and speed things up a lot in the future.

jspunda commented 7 years ago

I'm not sure what you mean. Both original .csv files are opened in 'read' mode. Then the new file is written with a different name: ProstateX-Images-Train-NEW.csv

Maybe I'm missing something...

schelv commented 7 years ago

That is what the description in csv_fix.py says. Then I read the code...

schelv commented 7 years ago

Can you give an example of how the data loading works? For example I want: X,y with X being the 2d tumor slices, and y the label.

jspunda commented 7 years ago

If you mean a cutout of the lesion from a particular slice, there is no functionality for it as of yet. You could however create a query object like the example code in h5 query.py. Let's say for all the ADC series. That will give you a subset of the data containing just the full dicom images and the lesion information.

The print result function in h5_query shows how to traverse this subset. At the very end it extracts one lesion attribute named 'ijk'. To get the label for that lesion, the attribute name should be changed to 'ClinSig' . If you want to access the raw dicom pixel data, it would look something like result[patient_id][dcm_series name]['pixel_array'][:]

schelv commented 7 years ago

works great!