mskcc / vcf2maf

Convert a VCF into a MAF, where each variant is annotated to only one of all possible gene isoforms
Other
372 stars 216 forks source link

Can maf2vcf handle hg38 maf as input ? #162

Closed ShixiangWang closed 6 years ago

ShixiangWang commented 6 years ago

Hi,

I want to transform hg38 maf to vcf,

$ ../biotools/vcf2maf/maf2vcf.pl --input-maf ../data/TCGA.LUAD.mutect.hg38.somatic.maf --output-dir ./test_vcfs
ERROR: Provided Reference FASTA is missing or empty! Path: ~/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

I know how to download the file, but the info remind me using GRCh37 while maf file I use based on hg38. Can I use this command, or I should use hg19 maf file?

Best wishes, Shixiang

ShixiangWang commented 6 years ago

Same errors no matter what genome build I use:

# use hg38 maf while ensembl v75 fa.gz
$ ../biotools/vcf2maf/maf2vcf.pl --input-maf ../data/TCGA.LUAD.mutect.hg38.somatic.maf --output-dir ./test_vcfs
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
ERROR: Make sure that ref-fasta is the same genome build as your MAF: /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

# use hg38 maf while ensembl v91 GRCh38 fa.gz
$../biotools/vcf2maf/maf2vcf.pl --input-maf ../data/TCGA.LUAD.mutect.hg38.somatic.maf --output-dir ./test_vcfs --ref-fasta ~/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
ERROR: Make sure that ref-fasta is the same genome build as your MAF: /home/diviner-wsx/.vep/homo_sapiens/91_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

# use hg19 maf while ensembl v75 fa.gz
$../biotools/vcf2maf/maf2vcf.pl --input-maf ~/Downloads/gsc_LUAD_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf --output-dir ./test_vcfs --ref-fasta ~/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz 
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
ERROR: Make sure that ref-fasta is the same genome build as your MAF: /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

hg19 maf download from https://portal.gdc.cancer.gov/legacy-archive/files/b2e25bdf-f2b5-4a37-b330-05251ea09f2c

hg38 maf download from https://portal.gdc.cancer.gov/files/0458c57f-316c-4a7c-9294-ccd11c97c2f9

Please help me.

ShixiangWang commented 6 years ago

I try using UCSC hg19 and hg38, and the outcome is

$ ../biotools/vcf2maf/maf2vcf.pl --input-maf ~/Downloads/gsc_LUAD_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf --output-dir ./test_vcfs --ref-fasta ~/Annotation/reference/hg19.fa
[W::fai_fetch] Reference 19:58420782-58420784 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 19:58420782-58420784
[W::fai_fetch] Reference 10:26581394-26581396 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 10:26581394-26581396
[W::fai_fetch] Reference 14:20528636-20528638 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 14:20528636-20528638
[W::fai_fetch] Reference 11:55657434-55657436 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 11:55657434-55657436
[W::fai_fetch] Reference 1:78269058-78269060 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 1:78269058-78269060
ERROR: Make sure that ref-fasta is the same genome build as your MAF: /home/diviner-wsx/Annotation/reference/hg19.fa

$ ../biotools/vcf2maf/maf2vcf.pl --input-maf ../data/TCGA.LUAD.mutect.hg38.somatic.maf --output-dir ./test_vcfs --ref-fasta ~/Annotation/reference/hg38.fa

$ cd test_vcfs/

$ ls -lh
total 1.8G
-rw-rw-r-- 1 diviner-wsx diviner-wsx  33K 3月  28 18:14 TCGA.LUAD.mutect.hg38.somatic.pairs.tsv
-rw-rw-r-- 1 diviner-wsx diviner-wsx 1.8G 3月  28 18:17 TCGA.LUAD.mutect.hg38.somatic.vcf

Thank your tool, the hg38 works.

So the hg19 geome build TCGA used is neither equals to ensembl v75 or UCSC hg19?

ShixiangWang commented 6 years ago

But the conent of vcf is very weird. :cry: How to explain this???

image

Left is outcome and the right is normal vcf file.

I open it using notepad++,

image

It seems normal, I am not sure. What the right part means?

ckandoth commented 6 years ago

@ShixiangWang there are many different files and commands you're using here. There is only one rule for maf2vcf - the --ref-fasta chromosome names and lengths must exactly match the chromosomes used in the --input-maf.

ShixiangWang commented 6 years ago

Thanks, I will write a script to check the output.

发自我的 iPhone

在 2018年3月28日,下午10:13,Cyriac Kandoth notifications@github.com<mailto:notifications@github.com> 写道:

@ShixiangWanghttps://github.com/ShixiangWang there are many different files and commands you're using here. There is only one rule for maf2vcf - the --ref-fasta chromosome names and lengths must exactly match the chromosomes used in the --input-maf.

― You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/mskcc/vcf2maf/issues/162#issuecomment-376900796, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AX5Y5DOg3Xov5kkNApywqdrwm3in9rUoks5ti5nngaJpZM4S-GNY.

ckandoth commented 6 years ago

It is most likely that your hg19.fa uses chr-prefixes for the chromosome names, while the MAF file does not. For example, your hg19.fa uses chr7 for chromosome 7, while the MAF uses just 7. Read more about this issue here - https://www.biostars.org/p/119295/#119308

ShixiangWang commented 6 years ago

@ckandoth Yes, the hg19.fa is different from GRCh37.fa

$ head Annotation/reference/hg19.fa
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

$ zcat ~/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz | head
>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

While Neither Ensembl v75 or hg19.fa from UCSC can work with this hg19 maf file download from TCGA https://portal.gdc.cancer.gov/legacy-archive/files/b2e25bdf-f2b5-4a37-b330-05251ea09f2c.

$.maf2vcf.pl --input-maf ~/Downloads/gsc_LUAD_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf --output-dir ./test_vcfs --ref-fasta ~/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz 
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
[E::fai_build3] Cannot index files compressed with gzip, please use bgzip
Could not load fai index of /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
ERROR: Make sure that ref-fasta is the same genome build as your MAF: /home/diviner-wsx/.vep/homo_sapiens/91_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

$ maf2vcf.pl --input-maf ~/Downloads/gsc_LUAD_pairs.aggregated.capture.tcga.uuid.automated.somatic.maf --output-dir ./test_vcfs --ref-fasta ~/Annotation/reference/hg19.fa
[W::fai_fetch] Reference 19:58420782-58420784 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 19:58420782-58420784
[W::fai_fetch] Reference 10:26581394-26581396 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 10:26581394-26581396
[W::fai_fetch] Reference 14:20528636-20528638 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 14:20528636-20528638
[W::fai_fetch] Reference 11:55657434-55657436 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 11:55657434-55657436
[W::fai_fetch] Reference 1:78269058-78269060 not found in FASTA file, returning empty sequence
Failed to fetch sequence in 1:78269058-78269060
ERROR: Make sure that ref-fasta is the same genome build as your MAF: /home/diviner-wsx/Annotation/reference/hg19.fa

Should I use GRCh 37.p13 from ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh37.p13/ instead of Ensembl v75? I am haunted by the three different source and I don't know how to pick a correct assembly.fa to work with TCGA hg19 maf file. I don't want to bother you more but can you give me a suggestion?

Thank you very much, Shixiang

ckandoth commented 6 years ago

The error [E::fai_build3] Cannot index files compressed with gzip, please use bgzip is easy to fix. It is usually due to having an older version of Perl. If it is not easy for you to upgrade your Perl, then unzip the GRCh37 fasta using gzip -d. VEP should work fine with unzipped FASTAs.

ShixiangWang commented 6 years ago

@ckandoth Thanks for your help.