remontoire-pac / ice-cancer-cell-lines

interoperable, integrated CTD^2 cancer cell-line computational environment exemplar
MIT License
3 stars 0 forks source link

Figure out how ICE will handle and expand mutations as binary variables #20

Open remontoire-pac opened 4 years ago

remontoire-pac commented 3 years ago

@sandrine-m @bpetros95

For our first pass at mutations, we will focus on non-synonymous coding SNPs that result in a single amino-acid change (not truncations). Here is a workflow description that is close to pseudocode for the process.

(1) Verify this is the correct 20Q2 file: https://ndownloader.figshare.com/files/22629110 (claims CSV, but actually TSV). (2) Extract at TABLE from the 20Q2 mutations file with the needed columns from all rows: Hugo_Symbol,Variant_Classification,Variant_Type,Tumor_Sample_Barcode,Protein_Change,Variant_annotation,DepMap_ID (3) Use Variant_Classification, Variant_Type, Variant_annotation to keep only rows for non-synonymous coding SNPs (no STOP codons). (4) Use Tumor_Sample_Barcode, DepMap_ID to keep only rows with a valid Achilles name. (5) With the filtered table, further restrict the columns to Hugo_Symbol ,Protein_Change, DepMap_ID. (6) Now the tricky part, we we want to make "any", "any at position", "specific SNP" versions of these.

The attachment shows a simple example of what this wants to look like for each Hugo_Symbol. image

We will come back to other rows in the original file separately as they have different interpretations (splice sites, fusions, truncations). We will come back to other columns in the original file separately as we will want to organize them in a less-redundant way.

bpetros95 commented 3 years ago

wait Paul, we may disagree on this pseudocode (relative to the version I sent Sandrine)

Stop codons are also relevant for my analysis (stop codon —> modified amino acid definitely doesn’t exist).

Here’s what I sent her:

The DepMap file that contains these mutations is called “CCLE_mutations.csv” and can be found here, and is attached: https://depmap.org/portal/download/ https://depmap.org/portal/download/.

I believe that the problem boils down to the following: Download and import CCLE_mutations.csv into Matlab as a table Remove all columns except Hugo_Symbol (+/-Entrez_Gene_Id), Variant_Class, Variant_Type, DepMap_ID, Protein_Change, and Variant_annotation Filter mutations for Variant_Type == SNP Use DepMap_ID to append ArxSpan_IDs or other cell-line ID of choice, for consistency with CTRP/Achilles files. Make a SNP_ID based on cases in which [Hugo_Symbol (+/-Entrez_Gene_Id), Variant_Class, Variant_Type, Protein_Change, Variant_annotation] are identical but DepMap_ID differs (e.g., same SNP for different cell lines) Create binary matrix of cell lines by SNPs (e.g., 1 represents cell line with ArxSpan_ID x having mutation characterized by SNP_ID y) Remove DepMap_IDs and ArxSpan_IDs from table Save table as descriptive table of SNP identities, and matrix as corresponding table of SNP presence

On Mar 3, 2021, at 12:42 PM, PAC notifications@github.com wrote:

For our first pass at mutations, we will focus on non-synonymous coding SNPs that result in a single amino-acid change (not truncations). Here is a workflow description that is close to pseudocode for the process.

PSEUDOCODE (1) Verify this is the correct 20Q2 file: https://ndownloader.figshare.com/files/22629110 (claims CSV, but actually TSV). (2) Extract at TABLE from the 20Q2 mutations file with the needed columns from all rows: Hugo_Symbol,Variant_Classification,Variant_Type,Tumor_Sample_Barcode,Protein_Change,Variant_annotation,DepMap_ID (3) Use Variant_Classification, Variant_Type, Variant_annotation to keep only rows for non-synonymous coding SNPs (no STOP codons). (4) Use Tumor_Sample_Barcode, DepMap_ID to keep only rows with a valid Achilles name. (5) With the filtered table, further restrict the columns to Hugo_Symbol ,Protein_Change, DepMap_ID. (6) Now the tricky part, we we want to make "any", "any at position", "specific SNP" versions of these. The attachment shows a simple example of what this wants to look like for each Hugo_Symbol.

https://user-images.githubusercontent.com/2120526/109847956-c0794180-7c1d-11eb-8999-5aef60aac74a.png We will come back to other rows in the original file separately as they have different interpretations (splice sites, fusions, truncations). We will come back to other columns in the original file separately as we will want to organize them in a less-redundant way.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/remontoire-pac/ice-cancer-cell-lines/issues/20#issuecomment-789925850, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJZQHJ5NOHSOFH256HPKOB3TBZYGTANCNFSM4TSRGRIQ.

remontoire-pac commented 3 years ago

@bpetros95 - our first priority is to deliver a SNP file that is consistent with the rest of ICE. The final format will therefore be "ICE-like" (most similar to the current annotations since these are just binary calls and don't have a value). It will exist a a sparse matrix (row_idx, col_idx) with separate tables to indicate the meanings of row_idx and col_idx respectively. It will be available in both MAT and CSV format.

I think that with the one exception you mention, what you need is a subset of what I am proposing. The exception is STOP codons, which we could record in this run without changing my procedure much (e.g., simply allow things like KRAS.pG12* instead of excluding them).

I'm willing to agree to this if you agree this is the only difference where you would lose information.

remontoire-pac commented 3 years ago

image Following our discussion, we will start with this prescription for ITEM-6 in the pseudocode above.

sandrine-m commented 3 years ago

Just wanted to ad a note here to be consistent with the ongoing slack discussion. At the moment we agreed on a simple version containing a binary matrix of "variant IDs" x CCLE cell lines + matrix of metadata for the "variant IDs" containing associated names but also characterization of the variant as binary classes (to ease for filtering). We were unclear on the type of variants we would include. I propose we include all variants that lead to a protein change (which is more inclusive that nonsynonymous SNP). Indeed, while studying function, it makes sense to have a focus on the changes that could affect the protein function. While implementing the idea, I realized that using (gene & protein change) as a key ID (=unique) is not accurate as we have some instances where different change in nucleotides lead to the same (gene/protein change) pair because of the degeneracy aspect of the amino acids code. The key I used is genome change which is almost unique (it will be unique once I fix the 106 inconsistencies spanning genome changes in 3 genes of location information in this file). For the moment those inconsistencies are filtered out (no associated protein change).

Finally, I was thinking to implement an additional layer of synthetic info (the inferred rows) in a second file. I think separating in layers makes more sense as it is less redundant and therefore more ICE-like.

Looking forward to hearing your feedback! I am of course open to adapt wat we have now based on your feedbacks/needs.