itmoon7 / onconpc

Clinical sequencing-based primary site classifier
GNU General Public License v2.0
32 stars 9 forks source link

About SNV and CNV data preparation #2

Closed carolineTr closed 11 months ago

carolineTr commented 1 year ago

Dear Intae Moon,

I would like to use the very interesting tool you developped in a prediction mode, and I would have some questions about the way SNV and CNV information has to be prepared.

Could you please let me know :

Thank you in advance for your help.

itmoon7 commented 1 year ago

Thanks for your questions!

  1. We've included every gene from the mutation files, since we're uncertain about the impact of these genes on predicting various cancer types, beyond those that are already well-established. Please note that the frequency of gene mutations has been encoded as positive integers (refer to line 203 in codes/process_features.py). In case it's helpful, we did run the ablation analysis where we found that the model maintained decent performance even when limited to the top 50% of genes, as determined by the auxiliary explanation model (refer to Extended Data Fig. 3 of the paper https://www.nature.com/articles/s41591-023-02482-6).

  2. Regarding the CNA events, the five levels of CNA are defined as follows:

    -2: deep loss - a homozygous deletion: both copies of a gene in a diploid cell are absent. -1: single-copy loss - a heterozygous deletion: one copy of the gene has been deleted in a diploid cell, while the other remains intact. 0: diploid - This level indicates the standard copy number, signifying that no deletion or amplification has occurred. 1: low-level gain - Indicates a minor increase in the gene's copy number. 2: high-level amplification - Indicates a substantial increment in the gene's copy number.

The raw CNA data we utilized is categorized into those 5 distinct levels for each gene. For detailed threshold information, one would need to consult the AACR_GENIE_DATA guidelines (https://www.aacr.org/professionals/research/aacr-project-genie/) and panel-specific documentation (OncoPanel: https://pubmed.ncbi.nlm.nih.gov/28557599/, MSK-IMPACT: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5808190/). Please let me know if you have any additional questions.

carolineTr commented 1 year ago

Thank you very much for your detailed explanations! I will read this information carefully and let you know if I have additionnal questions.

carolineTr commented 1 year ago

Dear Intae,

I read panel-specific documentation as you advised but I could not find threshold to ditinguish low from high level gains. Coluld you please let me know which theshold you use when you apply your model to new patients ?

Thank you for your help.

Best regards, Caroline

De : Intae Moon @. Envoyé : mercredi 4 octobre 2023 03:29 À : itmoon7/onconpc @.> Cc : TRUNTZER Caroline @.>; Author @.> Objet : Re: [itmoon7/onconpc] About SNV and CNV data preparation (Issue #2)

Thanks for your questions!

  1. We've included every gene from the mutation files, since we're uncertain about the impact of these genes on predicting various cancer types, beyond those that are already well-established. Please note that the frequency of gene mutations has been encoded as positive integers (refer to line 203 in codes/process_features.py). In case it's helpful, we did run the ablation analysis where we found that the model maintained decent performance even when limited to the top 50% of genes, as determined by the auxiliary explanation model (refer to Extended Data Fig. 3 of the paper https://www.nature.com/articles/s41591-023-02482-6).

ð On compte pour chaque gène le nb de variants

  1. Regarding the CNA events, the five levels of CNA are defined as follows:

-2: deep loss - a homozygous deletion: both copies of a gene in a diploid cell are absent. -1: single-copy loss - a heterozygous deletion: one copy of the gene has been deleted in a diploid cell, while the other remains intact. 0: diploid - This level indicates the standard copy number, signifying that no deletion or amplification has occurred. 1: low-level gain - Indicates a minor increase in the gene's copy number. 2: high-level amplification - Indicates a substantial increment in the gene's copy number.

The raw CNA data we utilized is categorized into those 5 distinct levels for each gene. For detailed threshold information, one would need to consult the AACR_GENIE_DATA guidelines (https://www.aacr.org/professionals/research/aacr-project-genie/) and panel-specific documentation (OncoPanel: https://pubmed.ncbi.nlm.nih.gov/28557599/, MSK-IMPACT: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5808190/). Please let me know if you have any additional questions.

— Reply to this email directly, view it on GitHubhttps://github.com/itmoon7/onconpc/issues/2#issuecomment-1745988751, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AID2KR223PWZQQ63UMGJJYDX5S3UZAVCNFSM6AAAAAA5O7KQO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBVHE4DQNZVGE. You are receiving this because you authored the thread.Message ID: @.**@.>>

itmoon7 commented 1 year ago

Hi Caroline,

As previously noted, we made use of the CNA data that had pre-classified copy number levels from DFCI. Unfortunately, we do not have access to the specific thresholds they applied at the moment. That said, I recommend reviewing Table 1 in the following article (https://elifesciences.org/articles/50267). It details CNA level alongside its absolute copy number. Additionally, details about the CNA threshold for the MSK-IMPACT panel can be located in the "Copy Number Variant and Structural Variant Calling" section of the paper found at this link: (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5808190/). It's very likely that similar thresholds were utilized for the CNA data we worked with.

All the best, Intae

carolineTr commented 1 year ago

Thank you again for your answer, it helps.

Best regards, Caroline

De : Intae Moon @. Envoyé : jeudi 5 octobre 2023 15:57 À : itmoon7/onconpc @.> Cc : TRUNTZER Caroline @.>; Author @.> Objet : Re: [itmoon7/onconpc] About SNV and CNV data preparation (Issue #2)

Hi Caroline,

As previously noted, we made use of the CNA data that had pre-classified copy number levels from DFCI. Unfortunately, we do not have access to the specific thresholds they applied at the moment. That said, I recommend reviewing Table 1 in the following article (https://elifesciences.org/articles/50267). It details CNA level alongside its absolute copy number. It's very likely that similar thresholds were utilized for the CNA data we worked with.

All the best, Intae

— Reply to this email directly, view it on GitHubhttps://github.com/itmoon7/onconpc/issues/2#issuecomment-1748955346, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AID2KR5EXORRODETA2TJ6CDX524DPAVCNFSM6AAAAAA5O7KQO6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBYHE2TKMZUGY. You are receiving this because you authored the thread.Message ID: @.***>