ncoudray / DeepPATH

Classification of Lung cancer slide images using deep-learning
489 stars 210 forks source link

Advice needed on mutation classification #37

Closed bcli4d closed 5 years ago

bcli4d commented 5 years ago

Hi Nicolas,

I've done mutation classification training (from scratch) for about 500K batches at 30 tiles/batch. Running classification on test data shows the model seems to be recognizing EGFR pretty well (AUC ~ 0.82), but AUC for all the other mutations kind of dances around 0.5. However, the AUCs for the validation data set look, to my naive eye, more like what I would expect. Here is a chart: Screen Shot 2019-06-10 at 1 47 19 PM Is this seeming lack of correlation between validation and testing results to be expected? Am I just in the early stages of training? How many batches did you need to run when you trained for mutation classification? Is your final checkpoint available somewhere? It would be useful in determining whether my testing/validation/viz steps are functioning as expected.

Regards, Bill

ncoudray commented 5 years ago

Hi Bill,

Did you do a pre-selection of LUAD tiles from LUAD slides beforehand, and checked that this part is correct?

Best, Nicolas

dgutman commented 5 years ago

So this may not be the best place to post,

But I built and maintain http://cancer.digitalslidearchive.net/

and there's an entire API/backend that can be used to get tile/region data. I could also potentially post/host some of the results from any people who are building and/or want to disseminate models/code/whatever related to the TCGA data sets..

On Mon, Jun 10, 2019 at 7:24 PM Nicolas Coudray notifications@github.com wrote:

Hi Bill,

Did you do a pre-selection of LUAD tiles from LUAD slides beforehand, and checked that this part is correct?

Best, Nicolas

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ncoudray/DeepPATH/issues/37?email_source=notifications&email_token=AAFODTWGS4D7OHHH3US7KNTPZ3PEBA5CNFSM4HWYWECKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXLP5DY#issuecomment-500629135, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFODTULSPW63BOKK7K6XY3PZ3PEBANCNFSM4HWYWECA .

-- David A Gutman, M.D. Ph.D. Assistant Professor of Neurology, Psychiatry, and Biomedical Informatics Emory University School of Medicine

bcli4d commented 5 years ago

Thanks David. Given the enormous monetary cost (and apparently the environmental cost) to train these very large models, sharing results is extremely important.

Bill

bcli4d commented 5 years ago

Nicolas,

Once I've fine tuned the model for tumor classification, I then need to identify all LUAD tiles that were scored correctly. Can you suggest how to do this?... It doesn't seem that training preserves the classification of individual tiles in the training set.

Regards, Bill

ncoudray commented 5 years ago

When you use "0d_SortTiles.py " to sort the images based on their mutations, you can use the last options mentioned in the README of the github page:

This should just create a LUAD folder with only the tiles labeled as LUAD by the classifier. Once done, you can move on with the next steps. Make sure you use the "sigmoid" training. Also, when you convert the TFRecord, that's when the mutations are assigned to each tile/slide. In the test and validation set, the TFRecord are created per slide and the name is changed (end of the name correspond to the labels, in binary. For example ....1000000001.TFRecord would have EGFR and TP53 mutations (again, mutations in alphabetical order) - this naming is just a way to easily double check everything's been done properly up to there)

HTH N.

dgutman commented 5 years ago

Yeah I just saw the same post..I'll start a separate issue on the github page to solicit ideas of how to make it easier for people to grab images and tiles as well post modelz or results..

On Tue, Jun 11, 2019, 1:30 PM bcli4d notifications@github.com wrote:

Thanks David. Given the enormous monetary cost (and apparently the environmental cost) https://www.technologyreview.com/s/613630/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/ to train these very large models, sharing results is extremely important.

Bill

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ncoudray/DeepPATH/issues/37?email_source=notifications&email_token=AAFODTT7G7B6XCR7OUVT63TPZ7OMFA5CNFSM4HWYWECKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXN5CQQ#issuecomment-500945218, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFODTQJHVXYOND2SITYEM3PZ7OMFANCNFSM4HWYWECA .

bcli4d commented 5 years ago

Nicolas,

I appreciate you help, and hope I'm not being too dense...

To generate the out_filename_Stats.txt, do I run all tiles from LUAD slides through the (tumor) classifier? Specifically, do I include tiles that were in the training and validation sets that were used to train the classifier?

If that is the case, then it seems like I need to resort the LUAD tiles using sort option 10, setting --PercentTest=100? Then use build_TF_test.py to generate TFRecord files that I then run through the classifier?

Bill

ncoudray commented 5 years ago

Hi Bill -

Correct. You can do it this way. That way you can segment all of the images

Best N.

bcli4d commented 5 years ago

Nicolas,

How many batches were required to train the model for mutation classification? I need to understand if I have the budget for fully training in the Google cloud.

Bill

ncoudray commented 5 years ago

Bill, I think I did batches of 30 over 500k iterations at that time. N.

bcli4d commented 5 years ago

Amazing timing...I was working on a new response when you closed this issue!

I recently spent more time working on this but with similar results. I filtered the LUAD tiles as you described above. This reduced the number of tiles from 923K to 734K. I've now trained using the reduced tile set for 560K batched (of 30 tiles per batch). Unfortunately the results are not at all consistent with your results. The ROC curves for the validation set are attached (AvPb followed by PcSel.)

Is it possible to get you final mutation classification checkpoint? If I have that then I can possibly determine whether there is a problem in my process.

Regards, Bill

out2_roc_data_AvPb out2_roc_data_PcSel

ncoudray commented 5 years ago

Hi - The number of tiles you have is surprisingly high! In the paper, we had ~350 k tiles from LUAD slides (Supp table 6, before selection of LUAD tiles, frozen exclusively used). Do you have more slides than just the ones from the TCGA? Are you sure that you properly selected only the LUAD tiles from the LUAD slides? Please have a close look first at the tiles that you used as input. Also, you can share all you submission scripts so I can have a look at them (though it may take me more time).

Thanks, Nicolas

bcli4d commented 5 years ago

Hi Nicolas,

These are 20x magnification tiles, 299 pixels, 0 overlap, 25% background threshold. I'm only using the frozen tissue slides. There are currently 1067 TCGA-LUAD slides (823 tumors, 244 normal) and 1100 TCGA-LUSC slides (753 tumors, 347 normal) frozen tissue slides. For the mutation classification, I only used tiles from TCGA-LUAD slides marked as tumors.

Are the numbers in Supp. table 6 perhaps reduced due to masking? I don't have access to pathologist annotated masks.

My work is in the form of the Jupyter notebook in this repo. It's a big notebook, and if you really want to look at it (which I'd appreciate), I think that you will need to open it as a notebook. I.E., sometimes you can view a notebook in github, but this one seems beyond github's abilities somehow. That is probably because I saved it to github with all the cell output intact, and which is substantial, including ROC curves, a heatmap, etc. I think that keeping that stuff will help with understanding if you can just get the notebook opened. Of course, for just reviewing the notebook, you don't need to go through the installation steps that are described in the Readme; you should be able to just open the notebook on some local machine.

In my installation, I've got a bunch of nb extensions enabled, particularly the table of contents extension, which is very helpful in navigating through the notebook. However, I haven't had time to figured out how to propagate such extension configurations.

Regards, Bill

ncoudray commented 5 years ago

Hi Bill,

I'll have a look once I have time, but probably not in the next few days. We tiled at 512 pixels, but I doubt this would make a difference (it would explain why you have more tiles though). Meanwhile, you can try to do a binary classification (with softmax) of just STK11 vs WT, and see what this one gives (I feel it's easier to start this way to debug and make sure it's on a good track). Also, be careful when you sort for mutations that you point SourceFolder to the good folder (aka only the LUAD), or, easier, that the out_filenamestats was obtained by running the 3-way classifier exclusively on LUAD slides.

Best, Nicolas

bcli4d commented 5 years ago

Thanks Nicolas.

I tiled at 299 since it's my understanding that is the image size expected by Inception V3. I

I'll try your suggestion.

Bill

On Mon, Jul 22, 2019 at 8:18 AM Nicolas Coudray notifications@github.com wrote:

Hi Bill,

I'll have a look once I have time, but probably not in the next few days. We tiled at 512 pixels, but I doubt this would make a difference (it would explain why you have more tiles though). Meanwhile, you can try to do a binary classification (with softmax) of just STK11 vs WT, and see what this one gives (I feel it's easier to start this way to debug and make sure it's on a good track). Also, be careful when you sort for mutations that you point SourceFolder to the good folder (aka only the LUAD), or, easier, that the out_filenamestats was obtained by running the 3-way classifier exclusively on LUAD slides.

Best, Nicolas

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ncoudray/DeepPATH/issues/37?email_source=notifications&email_token=AGK2VIZSG3HIHQXDYL6HBILQAXFTDA5CNFSM4HWYWECKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2QIOBQ#issuecomment-513836806, or mute the thread https://github.com/notifications/unsubscribe-auth/AGK2VI7XHSLTOTOAGFPANALQAXFTDANCNFSM4HWYWECA .

tsirigos commented 5 years ago

Just a quick note for Bill: did you chose the LUAD tiles or the tumor tiles? Your note suggests the latter, I think. Aris

On Mon, Jul 22, 2019 at 11:18 AM Nicolas Coudray notifications@github.com wrote:

Hi Bill,

I'll have a look once I have time, but probably not in the next few days. We tiled at 512 pixels, but I doubt this would make a difference (it would explain why you have more tiles though). Meanwhile, you can try to do a binary classification (with softmax) of just STK11 vs WT, and see what this one gives (I feel it's easier to start this way to debug and make sure it's on a good track). Also, be careful when you sort for mutations that you point SourceFolder to the good folder (aka only the LUAD), or, easier, that the out_filenamestats was obtained by running the 3-way classifier exclusively on LUAD slides.

Best, Nicolas

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ncoudray/DeepPATH/issues/37?email_source=notifications&email_token=AAFKPAOE7BPENTUYK7QMETDQAXFS7A5CNFSM4HWYWECKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2QIOBQ#issuecomment-513836806, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFKPANA6XJBZALLUQRODDLQAXFS7ANCNFSM4HWYWECA .

bcli4d commented 5 years ago

Aris,

I'm not sure what you mean by "the tumor tiles", nor what I wrote that concerns you..

Anyway, for mutation classification, I used TCGA-LUAD files whose barcode indicates that they are frozen tissue samples and not "normal tissue". I ran all the tiles from these files using the previously trained classifier, and used the resulting out_filename_stats.txt in the subsequent sort for mutation classification.

Bill

bcli4d commented 5 years ago

Hi Nicholas,

Finally back to this...

You recommended that I "do a binary classification (with softmax) of just STK11 vs WT". What is the definition of WT in this case? Samples that don't have a STK11 mutation, but may have other of the ten mutations? Or samples that have none of the ten mutations?

Regards, Bill

ncoudray commented 4 years ago

Hi Bill - Sorry, I've been very busy writing a paper. We are going to send you a comprehensive set of instructions very soon for the binary classification. Thanks for your patience, Best, Nicolas

bcli4d commented 4 years ago

Thanks Nicholas, I understood. As soon as I said I was back on this, I needed to work on something else.

On Tue, Sep 3, 2019 at 12:32 PM Nicolas Coudray notifications@github.com wrote:

Hi Bill - Sorry, I've been very busy writing a paper. We are going to send you a comprehensive set of instructions very soon for the binary classification. Thanks for your patience, Best, Nicolas

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ncoudray/DeepPATH/issues/37?email_source=notifications&email_token=AGK2VI3T4MHQLVVNACHZ3RDQH23UDA5CNFSM4HWYWECKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5ZJ7EQ#issuecomment-527605650, or mute the thread https://github.com/notifications/unsubscribe-auth/AGK2VI57JGVGIJH7FX3FHIDQH23UDANCNFSM4HWYWECA .

ncoudray commented 4 years ago

Hi Bill -

So for binary classification with softmax, by "WT", I indeed meant everything that is non-STK11. Below is an example of results we obtained:

Screenshot 2019-09-18 10 40 43

We repeated the training 4 times to see variability of different runs. The number of mutants being small, the CIs are pretty wide, as in the paper.

The graph above was obtained using the following script:

bazel-bin/inception/imagenet_train --num_gpus=4 --batch_size=400 --train_dir="full_path_to/results_T090f_STK11_d10b" --data_dir="T090f_TFRecord_train" --ClassNumber=2 --mode='0_softmax' --NbrOfImages=154758 --save_step_for_chekcpoint=389 --max_steps=38901 --num_epochs_per_decay=10

To help you reproduce the results, I attach the list of slides used in the test, validation and training sets:

STK11_train.txt STK11_valid.txt STK11_test.txt

The file used to select the LUAD tiles (out_filename_Stats_8av3.txt) when creating the datasets with "0d_SortTiles.py" is too big to post on github (122MB - but if you send me an email with a link to your preferred file sharing system, I will send it to you).

Let us know if you can get the same results with the same split. If this works, feel free to try other data splits or options.

Note that it still looks like mutant with allele frequency are better identified: if we re-compute the above AUCs but labeling as "STK11" only those with allele frequency above 0.3, the average AUC is ~3% higher:

Screenshot 2019-09-18 10 40 49

Also, on the main github page, I added an "example_TCGA_lung" folder with information that should help reproducing the results from the paper (We currently favor binary classifiers, as we feel we get a better and easier control on data than with a 10-way classifier). Checkpoints are also available.

HTH, Best, Nicolas

gabrieldernbach commented 2 years ago

The example_TCGA_lung section is very helpful.

On EGFR I have one split where I consistently get the model above 0.82 (CI 0.71 - 0.94) AUC, on another split it didn't go beyond 0.61 (CI 0.43 - 0.79) AUC.

What were your experiences with different splits?

ncoudray commented 2 years ago

Hi Gabriel - Beyond EGFR, from my general experience, variations between different splits can indeed vary quite much. It seems to depend on different factors, the main one being the size of the cohort. Sometimes, optimizing the architecture's parameters, or doing data augmentation or normalization can help. Also, the balancing and consistency of the input data can affect results (whether only tumor regions are used or not, staging of cancers consideration, data included, etc...) N.

monajalal commented 2 years ago

@ncoudray and @bcli4d and @gabrieldernbach Could you please tell what was your data? Was it the entire data of LUAD with 566 cases from which 81 are EGFR? Also, did you use Top Slide or Bottom Slide from tissue slide or did you use Diagnostic Slide?

For example, if you have used Top Slide (TS), we have 74 no EGFR and 14 EGFR cases in test set (assuming a separation of 60/20/20 for train/val/test) which means we will have an accuracy of 84.1 if our model does predict everything as non-EGFR which is the majority class.

So could you please verify what was the proportion of EGFR vs non-EGFR in your training and what was your confusion matrix as well as your specificity and sensitivity? Besides which type of SVS images you have used? Thank you