mahmoodlab / CLAM

Data-efficient and weakly supervised computational pathology on whole slide images - Nature Biomedical Engineering
http://clam.mahmoodlab.org
GNU General Public License v3.0
1.1k stars 359 forks source link

Skipping files? #7

Closed tmabraham closed 4 years ago

tmabraham commented 4 years ago

I tried to run the following command:

python create_patches.py --source $SOURCE_DIR --save_dir $SAVE_DIR --patch_size 256 --seg --patch --stitch 

It processes the first image fine, but then I get the following error when processing the second image:

progress: 0.00, 0/10616
processing 0005f7aaab2800f6170c399693a96917.tiff
Creating patches for:  0005f7aaab2800f6170c399693a96917 ...
Bounding Box: 3424 3232 5409 21457
Contour Area: 27171456.0
patches extracted: 493
original size: 27648 x 29440
downscaled size for stiching: 432 x 460
number of patches: 493
patch shape: (256, 256, 3)
start stitching 0005f7aaab2800f6170c399693a96917
progress: 0/493 stitched
progress: 50/493 stitched
progress: 100/493 stitched
progress: 150/493 stitched
progress: 200/493 stitched
progress: 250/493 stitched
progress: 300/493 stitched
progress: 350/493 stitched
progress: 400/493 stitched
progress: 450/493 stitched
segmentation took 0.3680613040924072 seconds
patching took 4.518922328948975 seconds
stitching took 0.18634939193725586 seconds

progress: 0.00, 1/10616
processing 000920ad0b612851f8e01bcc880d9b3d.tiff
Creating patches for:  000920ad0b612851f8e01bcc880d9b3d ...

Traceback (most recent call last):
  File "create_patches.py", line 294, in <module>
    process_list = process_list, auto_skip=args.no_auto_skip)
  File "create_patches.py", line 188, in seg_and_patch
    heatmap, stitch_time_elapsed = stitching(file_path, downscale=64)
  File "create_patches.py", line 14, in stitching
    heatmap = StitchPatches(file_path, downscale=downscale, bg_color=(0,0,0), alpha=-1, draw_grid=False)
  File "/home/tmabraham/CLAM/wsi_core/WholeSlideImage.py", line 46, in StitchPatches
    file = h5py.File(hdf5_file_path, 'r')
  File "/home/tmabraham/anaconda3/envs/clam/lib/python3.7/site-packages/h5py/_hl/files.py", line 408, in __init__
    swmr=swmr)
  File "/home/tmabraham/anaconda3/envs/clam/lib/python3.7/site-packages/h5py/_hl/files.py", line 173, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 88, in h5py.h5f.open
OSError: Unable to open file (unable to open file: name = 'PANDA/patches/000920ad0b612851f8e01bcc880d9b3d.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

This seems to indicate that it is skipping a file and not creating patches for a file. Indeed, when I remove the stitching option, I can see the program is skipping through many images. I get terminal output like this:

progress: 0.00, 13/10616
processing 004f6b3a66189b4e88b6a01ba19d7d31.tiff
Creating patches for:  004f6b3a66189b4e88b6a01ba19d7d31 ...
segmentation took 0.22474193572998047 seconds
patching took 3.695487976074219e-05 seconds
stitching took -1 seconds

Whereas for a properly processed image (without stitching option) I get:

progress: 0.00, 6/10616                                                                                                                                                   
processing 003046e27c8ead3e3db155780dc5498e.tiff                                                                                                                          Creating patches for:  003046e27c8ead3e3db155780dc5498e ...                                                                                                               Bounding Box: 16176 3488 5441 28721                                                                                                                                       Contour Area: 35853696.0                                                                                                                                                 
patches extracted: 638                                                                                                                                                    
segmentation took 0.3057291507720947 seconds                                                                                                                              patching took 5.923604488372803 seconds                                                                                                                                   stitching took -1 seconds

Here is an example image (plt.imshow) file in question: image

Why is the program skipping images? My understanding is that right now, you use simple binary thresholding, correct? Is it possible that the threshold is not correct for my dataset?

fedshyvana commented 4 years ago

Hi Tanishq, yes you're right that the reason a file is skipped usually means it didn't detect any tissue content and therefore no patches are extracted following segmentation (as indicated by the fact that the terminal output does not produce an error). This is most frequently caused by either:

  1. the binary threshold is set too high to detect the tissue
  2. the area filter is too high for the size of your tissue so they get filtered out as artifacts

You might want to look at the segmentation outputs produced, which should be saved in the same folder as the patches and stitches to figure out what the segmentation looks like and if this happens for rare cases in your dataset, in which case you can follow the guide to tune for segmentation/patching parameters for individual slides, without having to reprocess the entire batch. Otherwise if you notice the segmentation is unsatisfactory for most of your data, you might consider setting a different set of paramters globally.

ps. this is why I recommend in the guide to first go through the dataset once with only --seg enabled, this will go through your dataset and just save the segmentation masks, which takes minimal time, and you'll get a chance to see if anything major needs to be tweaked and edit parameters if needed, before running through the whole dataset with patching and stitching.

tmabraham commented 4 years ago

@fedshyvana Thanks for the suggestions. I fine-tuned the parameters a little bit on a small subset of the dataset and then started creating the segmentation masks, but then it reached a slide that was empty and raised an error.

Is it possible to indicate slides to skip?

Even better, could the program automatically skip through slides if there is an error and return the list of skipped slides?

tmabraham commented 4 years ago

Never mind, it seems like the process_list csv will allow me to skip certain files by setting it to 0?

fedshyvana commented 4 years ago

@tmabraham yes that is correct you can skip files by setting them to 0 in your edited csv file. Just make sure you save the csv file as a copy (or rename it) and pass it as an argument when running the script. Regarding skipping files automatically when certain criteria are met, and generating logs regarding their status, I think they're reasonable features that should not be difficult to implement. When I have time I will try to add them in future updates.

tmabraham commented 4 years ago

@fedshyvana If you haven't been able to add the automatic file-skipping, I have a basic implementation for this and I might be able to open a pull request sometime next week.