mskcc / vcf2maf

Convert a VCF into a MAF, where each variant is annotated to only one of all possible gene isoforms
Other
374 stars 217 forks source link

The best solution for one-to-one correspondence between genes and transcripts #343

Open user-tq opened 1 year ago

user-tq commented 1 year ago

Thank you for developing this awsome tool. I would like to know what is the best practice for selecting a unique transcript based on vep2maf. I am in a clinical analysis scenario, focusing on dozens of genes. I plan to create a corresponding table of genes and transcripts based on grch37 based on the MANE project. Then let vcf2maf accept this table and filter it. I noticed that custom inst seems to be able to solve this problem. But I am a bit confused, different versions of transcripts should produce different tables. How can I clarify my transcript version? And if there are genes outside of these dozens in my data, maby they won't be annotated?

user-tq commented 1 year ago

In order to obtain as many transcripts as possible corresponding to genes,i do

zcat   /mnt/tool/software_tq/myscript/MANE.GRCh38.v1.0.summary.txt.gz |awk -F'\t' '{print $4,$6,$8,$10}'|grep 'MANE Select'|awk '{print $3}'|awk -F. '{print $1}'  > MANE.list

vcf2maf.pl --input-vcf ../vcfs/patient101.vcf --output-maf    test_patient101.maf             --tumor-id patient101.tumor             --ref-fasta /mnt/tool/ref_source/iGenomes/references/Homo_sapiens/GATK/GRCh37/Sequence/WholeGenomeFasta/human_g1k_v37_decoy.fasta             --vep-data /mnt/script/tanq/snakemake/ngs-pipeline/vep_cache             --ncbi-build GRCh37             --vep-path $vep_path             --maf-center mane_test             --normal-id patient101.normal  --vep-overwrite   --verbose    --custom-enst  MANE.list ```

In the end, I realized that this was a bad idea, based on the KMT2B I annotated in vep grch37 KMT2B-ENST00000222270-NM_014727.1

KMT2B,frameshift_variant,p.Asp375GlufsTer11,ENST00000222270,NM_014727.1;
KMT2B,frameshift_variant,p.Asp375GlufsTer11,ENST00000420124,;
KMT2B,frameshift_variant,p.Asp375GlufsTer11,ENST00000341701,;
ZBTB32,downstream_gene_variant,,ENST00000262630,NM_014383.1;
ZBTB32,downstream_gene_variant,,ENST00000392197,;
ZBTB32,downstream_gene_variant,,ENST00000426659,;
KMT2B,non_coding_transcript_exon_variant,,ENST00000607650,;
KMT2B,non_coding_transcript_exon_variant,,ENST00000606995,;
ZBTB32,downstream_gene_variant,,ENST00000481182,;

but in MANE select (base on GRCh38) KMT2B-NM_014727.3-ENST00000420124.4

nuttynutmore commented 5 months ago

I have a similar question.

I have a list of genes, some of which I'd like to annotate using manually curated ENST IDs, and the rest I am using the canonical ENST ID.

My question is: Can I just provide the custom ENST IDs that I need, or do I need to provide a full list of ENST IDs if I am using the --custom-enst flag? (i.e. by providing a partial list, will only the genes with ENST IDs in that list get annotated?)

Thanks, Kind regards, nuttynutmore

ckandoth commented 5 months ago

@nuttynutmore - you are correct. If you're happy with VEP's selection of canonical isoform, then you don't need to include it in your --custom-enst list.

@user-tq - the logic for selecting a single reportable effect on a single transcript per gene is implemented here. Typically, your --custom-enst file would list only one transcript per gene that we will prioritize for reporting. But if you list two of them, then the transcript with the higher priority consequence is used. E.g. if it's intronic on one of your preferred isoforms, and missense on the other.