jlevy44 / PathFlowAI

A High-Throughput Workflow for Preprocessing, Deep Learning Analytics and Interpretation in Digital Pathology
https://jlevy44.github.io/PathFlowAI/
MIT License
38 stars 8 forks source link

Preprocessing status and outputs #24

Closed asmagen closed 4 years ago

asmagen commented 4 years ago

Hi @jlevy44

How do I know what's the status of the preprocessing procedure during and after execution and what are the outputs I should see in terms of files being written?

The following ran for a couple of seconds and just printed '512' at the end, I don't see any file inputs in the directory specified here.

command = 'pathflowai-preprocess preprocess_pipeline \
          -odb patch_information.db \
          --preprocess \
          --patches \
          --basename ' + stainID +'/ \
          --input_dir ' + PFAI_dir + ' \
          --patch_size 512 \
          --intensity_threshold 45. \
          -tc 7 \
          -t 0.05'
print(command)
os.system(command)

Output:

pathflowai-preprocess preprocess_pipeline           -odb patch_information.db           --preprocess           --patches           --basename Li63NDCLAMP/           --input_dir PFAI_inputs           --patch_size 512           --intensity_threshold 45.           -tc 7           -t 0.05
512

And these are the input files I have in that folder (just one sample for now to test if it works):

Screen Shot 2020-05-26 at 7 17 02 PM

And to make sure my plan is compatible with this function, I plan to run it as a last step in a loop that processes each slide from ndpi separately and preparing the input to PFAI. That's assuming the PFAI preprocess command will concatenate the data when it's being called on new slides. I'll just remove the '--preprocess' flag after the first iteration. In that context, everytime I run the command with --preprocess I basically instruct it to redefine the database? Does it delete the old one? And where are they stored?

jlevy44 commented 4 years ago

So the two key flags are --preprocess and --patches . --preprocess creates a zarr file in place of the png, and --patches will construct or append a SQL db. Can you change your npy file to end in _mask.npy instead of just .npy ?

jlevy44 commented 4 years ago

https://github.com/jlevy44/PathFlowAI/issues/26

asmagen commented 4 years ago

I get the same output after changing the file name:

command = 'pathflowai-preprocess preprocess_pipeline \
          -odb patch_information.db \
          --preprocess \
          --patches \
          --basename ' + stainID +'/ \
          --input_dir ' + PFAI_dir + ' \
          --patch_size 512 \
          --intensity_threshold 45. \
          -tc 7 \
          -t 0.05'
print(command)
os.system(command)
pathflowai-preprocess preprocess_pipeline           -odb patch_information.db           --preprocess           --patches           --basename Li63NDCLAMP/           --input_dir PFAI_inputs           --patch_size 512           --intensity_threshold 45.           -tc 7           -t 0.05
512
Screen Shot 2020-05-26 at 10 00 18 PM

I don't see a zarr file or anything new in the directory, unless it's hidden somehow. I think this function would benefit from some text responses to the user showing what has been done and what was saved (it currently outputs just '512')

asmagen commented 4 years ago

And what's #26 you mentioned here? Is it a problem with the mask orientation that I'm using here?

jlevy44 commented 4 years ago

It’s reference to a potential bug we may need to work out from a previous patch

asmagen commented 4 years ago

@jlevy44 See the pending question above about the missing outputs

jlevy44 commented 4 years ago

Fair enough, we will add more progress updates. There should be some display though that indicates progress. You also have a forward slash near your stainID that should not be there

asmagen commented 4 years ago

Great. But again, what outputs should I see right now to evaluate whether it ran and completed appropriately or not? For example, what files should I see being created? I don't see any files but I don't know what to look for. There isn't any Zarr file in the directory. @jlevy44

jlevy44 commented 4 years ago

You should at least see outputs: https://github.com/jlevy44/PathFlowAI/blob/master/pathflowai/cli_preprocessing.py#L92

You should see outputs such as printed here: Data dump took XXX Adjust took XXX Patches took XXX

Have you adjusted your command as previously discussed?

asmagen commented 4 years ago

Yes, I removed the forward slash and it’s still printing only 512. Was the package updated in the last week or two to do these things I don’t see? Maybe I don’t the the latest version.

jlevy44 commented 4 years ago

It is possible that you do not have the latest software. As far as I am aware, this has been a long time feature of the package.

Your command syntax and print appear incorrect:

print(command)
os.system(command)
pathflowai-preprocess preprocess_pipeline           -odb patch_information.db      
asmagen commented 4 years ago

Now I'm getting the no such command issue with preprocess:

(saturn) jovyan@jupyter-assafmagen-2dpathflowai:~/project$ ls
computational_imaging  computational_imaging_old  ndpi_images  PFAI_inputs  scikit-image  tissue_masks  training_segmentation_images
(saturn) jovyan@jupyter-assafmagen-2dpathflowai:~/project$ pathflowai-preprocess preprocess_pipeline -odb patch_information.db --preprocess --patches --basename Li63NDCLAMP --input_dir PFAI_inputs --patch_size 256 --intensity_threshold 45. -tc 7 -t 0.05
nonechucks may not work properly with this version of PyTorch (1.5.0). It has only been tested on PyTorch versions 1.0, 1.1, and 1.2
Usage: pathflowai-preprocess [OPTIONS] COMMAND [ARGS]...
Try 'pathflowai-preprocess -h' for help.

Error: No such command 'preprocess_pipeline'.

(saturn) jovyan@jupyter-assafmagen-2dpathflowai:~/project$ pathflowai-preprocess --versionnonechucks may not work properly with this version of PyTorch (1.5.0). It has only been tested on PyTorch versions 1.0, 1.1, and 1.2
pathflowai-preprocess, version 0.1

The package is clearly installed and loaded on a GPU instance so what can the issue be?

asmagen commented 4 years ago

Here's the image Saturn Cloud have created for me to run the PFAI environment. Is it helpful on your end to load and see what's the issue? I don't see any other way I can resolve this.

jlevy44 commented 4 years ago

Just by looking at the YAML files, you have an old version of pathflowai specified. For the latest version of pathflowai, we recommend running:

pip install git+https://github.com/jlevy44/PathFlowAI.git

Adding within the docker container. Of course, if there are any bugs, I would highly encourage having the flexibility to rebuild the Docker within the HPC environment, or at least updating that Docker image put forth by another one housing the latest patch.

asmagen commented 4 years ago

Great, thanks. It's working now. See the outputs below. Can you let mom know if its looks okay? I'mm basically running a loop obtaining slides and both region masks which I'd like to segment as well as a background mask from histoQC which I use to ask the image. I calculate the hematoxylin channel because I don't want the stain to drive the segmentation here. I mask the hematoxylin and segmentation mask matrix to remove the tissue background, save as PNG and NPY and run the preprocess per stain with flags --preprocess --patches --patch_size 512 --intensity_threshold 45. -tc 7 -t 0.05 among others. I just wanted to confirm that these are all required both when I preprocess the first stain as well as the other ones, because for each one separately I want tot generate the patches and add them to the same database. In that context, would the database be initiated only once so by now it contains all the patches I added in previous test runs and it is therefore not set up correctly? If so then how do I clear it up before running this process here? To clarify, this process is generating all the patches I need and I don't want patches Fromm previous analysis or test runs to be there. Do I just delete the db file or is there anything else to do?

Also the process crashed after about 4 slides although I allocated 32 GB mem 40 GB HD and 1 GPU. Pushing it to 64 GB mem. Does it sound reasonable just for preprocessing or a mI doing something wrong?

output:

Li63N2DCLAMP-labels.tif
(1848, 3248, 7)
ndpi_images/Li63N2DCLAMP.ndpi
ASMA01/data/imaging/liver/raw_ndpi/DCLAMP/Li63N2DCLAMP.ndpi
Created LRU Cache for 'tilesource' with 82 maximum size
Using python for large_image caching
{'levels': 9, 'sizeX': 51968, 'sizeY': 29568, 'tileWidth': 256, 'tileHeight': 256, 'magnification': 20.0, 'mm_x': 0.00044142314822989324, 'mm_y': 0.00044142314822989324}
(7392, 12992, 3)
tissue_masks/Li63N2DCLAMP.ndpi_mask_use.png
ASMA01/data/imaging/liver/tissue_masks/DCLAMP/Li63N2DCLAMP.ndpi/Li63N2DCLAMP.ndpi_mask_use.png
(1848, 3248)
(7392, 12992)
(5336, 10500, 3)
(5336, 10500, 7)
PFAI_inputs/Li63N2DCLAMP.png
pathflowai-preprocess preprocess_pipeline           -odb patch_information.db           --preprocess           --patches           --basename Li63N2DCLAMP           --input_dir /home/jovyan/project/PFAI_inputs           --patch_size 512           --intensity_threshold 45.           -tc 7           -t 0.05
b'Data dump took 2.3092713356018066\nAdjust took 5.245208740234375e-05\nValid Patches Complete\nArea Info Complete\n               ID     x     y  patch_size  ...    3         4    5    6\n0    Li63N2DCLAMP     0     0         512  ...  0.0  0.000000  0.0  0.0\n1    Li63N2DCLAMP     0   512         512  ...  0.0  0.000000  0.0  0.0\n2    Li63N2DCLAMP     0  1024         512  ...  0.0  0.000057  0.0  0.0\n3    Li63N2DCLAMP     0  1536         512  ...  0.0  0.000141  0.0  0.0\n4    Li63N2DCLAMP     0  2048         512  ...  0.0  0.000000  0.0  0.0\n..            ...   ...   ...         ...  ...  ...       ...  ...  ...\n166  Li63N2DCLAMP  9216  2048         512  ...  0.0  0.000000  0.0  0.0\n167  Li63N2DCLAMP  9216  2560         512  ...  0.0  0.000000  0.0  0.0\n168  Li63N2DCLAMP  9216  3072         512  ...  0.0  0.000000  0.0  0.0\n169  Li63N2DCLAMP  9216  3584         512  ...  0.0  0.000000  0.0  0.0\n170  Li63N2DCLAMP  9216  4096         512  ...  0.0  0.000000  0.0  0.0\n\n[171 rows x 12 columns]\nPatches took 2.634455919265747\n'
Li59TDCLAMP-labels.tif
(1344, 1680, 7)
ndpi_images/Li59TDCLAMP.ndpi
ASMA01/data/imaging/liver/raw_ndpi/DCLAMP/Li59TDCLAMP.ndpi
{'levels': 8, 'sizeX': 26880, 'sizeY': 21504, 'tileWidth': 256, 'tileHeight': 256, 'magnification': 20.0, 'mm_x': 0.00044142314822989324, 'mm_y': 0.00044142314822989324}
(5376, 6720, 3)
tissue_masks/Li59TDCLAMP.ndpi_mask_use.png
ASMA01/data/imaging/liver/tissue_masks/DCLAMP/Li59TDCLAMP.ndpi/Li59TDCLAMP.ndpi_mask_use.png
(1344, 1680)
(5376, 6720)
(3736, 4412, 3)
(3736, 4412, 7)
PFAI_inputs/Li59TDCLAMP.png
pathflowai-preprocess preprocess_pipeline           -odb patch_information.db           --preprocess           --patches           --basename Li59TDCLAMP           --input_dir /home/jovyan/project/PFAI_inputs           --patch_size 512           --intensity_threshold 45.           -tc 7           -t 0.05
b'Data dump took 0.6201791763305664\nAdjust took 1.9073486328125e-05\nValid Patches Complete\nArea Info Complete\n             ID     x     y  patch_size  ...    3         4    5    6\n0   Li59TDCLAMP     0     0         512  ...  0.0  0.000000  0.0  0.0\n1   Li59TDCLAMP     0   512         512  ...  0.0  0.000000  0.0  0.0\n2   Li59TDCLAMP     0  1024         512  ...  0.0  0.000000  0.0  0.0\n3   Li59TDCLAMP     0  1536         512  ...  0.0  0.000000  0.0  0.0\n4   Li59TDCLAMP     0  2048         512  ...  0.0  0.000000  0.0  0.0\n5   Li59TDCLAMP   512     0         512  ...  0.0  0.000000  0.0  0.0\n6   Li59TDCLAMP   512   512         512  ...  0.0  0.000000  0.0  0.0\n7   Li59TDCLAMP   512  1024         512  ...  0.0  0.000000  0.0  0.0\n8   Li59TDCLAMP   512  1536         512  ...  0.0  0.000000  0.0  0.0\n9   Li59TDCLAMP   512  2048         512  ...  0.0  0.000183  0.0  0.0\n10  Li59TDCLAMP  1024     0         512  ...  0.0  0.000000  0.0  0.0\n11  Li59TDCLAMP  1024   512         512  ...  0.0  0.000000  0.0  0.0\n12  Li59TDCLAMP  1024  1024         512  ...  0.0  0.000000  0.0  0.0\n13  Li59TDCLAMP  1024  1536         512  ...  0.0  0.000000  0.0  0.0\n14  Li59TDCLAMP  1024  2048         512  ...  0.0  0.000275  0.0  0.0\n15  Li59TDCLAMP  1536     0         512  ...  0.0  0.000069  0.0  0.0\n16  Li59TDCLAMP  1536   512         512  ...  0.0  0.000134  0.0  0.0\n17  Li59TDCLAMP  1536  1024         512  ...  0.0  0.000000  0.0  0.0\n18  Li59TDCLAMP  1536  1536         512  ...  0.0  0.000160  0.0  0.0\n19  Li59TDCLAMP  1536  2048         512  ...  0.0  0.000504  0.0  0.0\n20  Li59TDCLAMP  2048     0         512  ...  0.0  0.000172  0.0  0.0\n21  Li59TDCLAMP  2048   512         512  ...  0.0  0.000141  0.0  0.0\n22  Li59TDCLAMP  2048  1024         512  ...  0.0  0.000286  0.0  0.0\n23  Li59TDCLAMP  2048  1536         512  ...  0.0  0.000694  0.0  0.0\n24  Li59TDCLAMP  2048  2048         512  ...  0.0  0.000305  0.0  0.0\n25  Li59TDCLAMP  2560     0         512  ...  0.0  0.000042  0.0  0.0\n26  Li59TDCLAMP  2560   512         512  ...  0.0  0.000191  0.0  0.0\n27  Li59TDCLAMP  2560  1024         512  ...  0.0  0.000603  0.0  0.0\n28  Li59TDCLAMP  2560  1536         512  ...  0.0  0.000336  0.0  0.0\n29  Li59TDCLAMP  2560  2048         512  ...  0.0  0.000252  0.0  0.0\n30  Li59TDCLAMP  3072     0         512  ...  0.0  0.000164  0.0  0.0\n31  Li59TDCLAMP  3072   512         512  ...  0.0  0.000191  0.0  0.0\n32  Li59TDCLAMP  3072  1024         512  ...  0.0  0.000263  0.0  0.0\n33  Li59TDCLAMP  3072  1536         512  ...  0.0  0.000015  0.0  0.0\n34  Li59TDCLAMP  3072  2048         512  ...  0.0  0.000000  0.0  0.0\n\n[35 rows x 12 columns]\nPatches took 0.8281879425048828\n'
Li3NDCLAMP-labels.tif
jlevy44 commented 4 years ago

Yeah, the memory utilization is likely due to running the processes in a for loop through jupyter, which Is prone to memory leaks. Typically, I would deploy each of these processes across the HPC. I would also check the resulting SQL database and make sure this is the patch size that you want. You can also add other patch sizes to capture info at a different resolution. You also need the masks in the same directory as the WSI, with the same basename, just replacing the extension with _mask.npy. You don’t want to preprocess the masks as if they were WSI. Everything else looks ok.

asmagen commented 4 years ago

What do you mean by 'You don’t want to preprocess the masks as if they were WSI'? Can you clarify how do I initiate a new database and once I change gym image input strategy?

asmagen commented 4 years ago

In addition to the above, I'm going to follow your advice regarding multiple patch sizes per image so I would do this:

    command = 'pathflowai-preprocess preprocess_pipeline \
          -odb patch_information.db \
          --preprocess \
          --patches \
          --basename ' + stainID +' \
          --input_dir ' + os.path.join(base_path,PFAI_dir) + ' \
          --patch_size 512 \
          --intensity_threshold 45. \
          -tc 7 \
          -t 0.05'
    print(command)
    result = subprocess.check_output(command, shell=True)
    print(result)

    command = 'pathflowai-preprocess preprocess_pipeline \
          -odb patch_information.db \
          --patches \
          --basename ' + stainID +' \
          --input_dir ' + os.path.join(base_path,PFAI_dir) + ' \
          --patch_size 1024 \
          --intensity_threshold 45. \
          -tc 7 \
          -t 0.05'
    print(command)
    result = subprocess.check_output(command, shell=True)
    print(result)

Omitting the preprocess flag from the second run per slide and using 512 and then 1024 size. Other than that, how do I determine the intensity_threshold/tc/t params?

@jlevy44

jlevy44 commented 4 years ago

That looks right to me. Do you have 7 output classes? -tc 7

Also, are you using a black background for the slides? Please convert the background to white if so.

asmagen commented 4 years ago

I do have 7 segmentation (not classification, to be clear) classes including the background as the first channel: Background,Bile Ducts,Normal,Tumor,Stroma,Tissue Fold,Lymphoid Aggregate so tc is supposed to match the number of mask channels, right? Isn't it better to not requiure that parameter in that case?

Why do you need the background white? I'm using the deconvolved grayscale hematoxylin image so most of the image is black. Actually that might be the problem - maybe they're all being filtered because it doesn't exceed that intensity threshold which you may have optimized for RGB images? How do I determine that?

jlevy44 commented 4 years ago

That info should be in the db file and can be visualized using any of our visualization functions (going to update). We remove background based on if it is white, which will be especially pertinent when we implement otsu thresholding.

You can change the threshold intensity, but it will grab the entire background if you use black, which you will have to filter out manually.

jlevy44 commented 4 years ago

One way to set the intensity is to try to ostu threshold one of your images, then take 255-otsu_threshold as the intensity.