Potential Issues with cancer_type parameter and associated files

iozkaraca commented 1 month ago

https://github.com/judithabk6/clonesig/blob/ec9d0fca0a81a0795517c95e540b3f17b20ede95/clonesig/run_clonesig.py#L282

Thank you for developing such an amazing tool! CloneSig is incredibly handy and easy to use.

I’ve encountered a couple of potential minor issues with the cancer_type parameter:

1.  Inconsistent cancer_type Indexing:

In line 282 (and the following few lines) of run_clonesig.py, the code loads data for cancer type signatures. Here, the column for OV appears as the 18th column (indexing starts from 0). However, in the example provided in the example notebook, the cancer type is labeled as cancer_type = 25. Based on my review of run_clonesig.py, it seems the reference values for cancer_type in the example notebook may not be accurate. I’d greatly appreciate it if you could double-check this.

Mismatch in Cancer-Specific Signatures: The cancer-specific signatures in match_cancer_type_sig_v3.csv and curated_match_signature_cancertype_tcgawes_literature.csv do not seem to match, particularly for OV. Although this doesn’t affect the CloneSig code, it could be misleading for users.

Thank you once again for your hard work on this fantastic software!

Best wishes,

Ismail

judithabk6 commented 3 weeks ago

Hi Ismail,

Thank you for your interest in CloneSig. Indeed, there is not a good consensus on the cancer types, and the different files. The cancer types and cohorts names are not consistent between the TCGA and the PCAWG (I am not finding a good webpage to show you as the ICGC portal has closed), you can see it at the CBIO portal. I tried to sit down with a medical oncologist to do a mapping between the two cancer type lists, and some were easy, like ovarian, but others were more complicated, and I dropped the idea. So I used those two files for different analyses. Are you analyzing samples from a single cancer type or do you consider a pan cancer study?

There can be some inconsistencies, as there is no general consensus on the present signatures for each cancer type, or on which signatures are artefacts or not and so on.

If you are considering a single cancer type, you can check with the most recent literature, and compile your own list of signatures to include in the analysis. If you are considering a pan-cancer analysis, the most recent table would be match_cancer_type_sig_v3.csv, coming from Alexandrov's 2020 paper (https://www.nature.com/articles/s41586-020-1943-3). You may also want to check the latest version of signatures (v3.4 on eht cosmic website https://cancer.sanger.ac.uk/signatures/sbs/), but I don't see an easy-to-download match table between cancer types and signatures. Probably the v3 that is in CloneSig is good enough, as far as I checked there are not so many differences between v3 and v3.4.

Sorry for the not so clear answer, but I am happy to be more specific to help with your particular analysis.

iozkaraca commented 3 weeks ago

Hi Judith,

I completely understand; preparing that kind of pan-cancer type dataset is indeed challenging.

Thank you for your offer to help. I’m currently working exclusively with OV, and the signatures in 'data/curated_match_signature_cancertype_tcgawes_literature.csv' are exactly what I need. Since CloneSig defaults to using this data, there’s no issue on that front for me.

However, in my analysis, I had to use default_MU = get_MU(cancer_type=18) instead of default_MU = get_MU(cancer_type=25) as shown in the example run (Section: Alternative parameters to adjust) to obtain the OV-specific signatures.

This seems like a minor discrepancy in the example instructions, but I thought it was worth mentioning in case you want to update the example run documentation regarding the cancer_type parameter.

judithabk6 / clonesig

Potential Issues with cancer_type parameter and associated files #8