taylor-lab / hotspots

Identifying recurrent mutations in cancer
http://www.ncbi.nlm.nih.gov/pubmed/26619011
GNU Affero General Public License v3.0
37 stars 23 forks source link

Code makes assumption not valid for official TCGA MAF #8

Closed ctokheim closed 7 years ago

ctokheim commented 7 years ago

Hi,

I'm getting an error originating from the amino acid length being NA.

It looks like from looking at the internals of the code that you assume the "Protein_position" column should be something like "position/length", where "position" is the amino acid position of the mutation and "length" is the total length of the protein. Despite a MAF file from TCGA containing a "Protein_position" column, it only contains the "position" part and not anything related to the protein length.

Collin

ctokheim commented 7 years ago

I left this out, but the MAF I'm referring to is here: https://synapse.org/MC3

ctokheim commented 7 years ago

I've fetched the ensembl protein lengths using the biomart R package, and incorporated that syntax into the "Protein_position" column. The algorithm starts running with that fix on the MAF.

ckandoth commented 7 years ago

Hi Collin. "Protein_position" in the format you described is this output of the VEP annotator. You can use maf2maf from the vcf2maf repo, which runs VEP to standardize MAF files in a format usable by most MAF parsers.

ctokheim commented 7 years ago

Ok, good to know. Thanks Cyriac!