griffithlab / pVACtools

http://www.pvactools.org
BSD 3-Clause Clear License
137 stars 59 forks source link

vep.vcf and readcount file must be sorted? #158

Closed Leehyeonjin93 closed 6 years ago

Leehyeonjin93 commented 6 years ago

Hi, I have a problem about inputfile.

I have vep.vcf file that include chromosome position with "chr". but my readcount file have not "chr" letter, DNA(normal,tumor),RNA all of them.

so, I altered readcount file and that included "chr" letter.

but, coverage file was empty. vep.vcf, cufflinks results and readcount file must be sorted? can I use altered readcount file with vep.vcf file?

Leehyeonjin93 commented 6 years ago

sorry, I sorted all file but I didn't get final file.

I think that isn't sorting problem.

perhaps, Can't vep.vcf,readcount and fpkm_tracking files get letter "chr" in position column?

susannasiebert commented 6 years ago

For the bam-readcount parsing to work, the chromosome names need to match between the VCF and the bam-readcount files. However, sorting is not required.

When you say that you "didn't get a final file", did the pVACseq run error out or does the final file not contain any predicted epitopes? If it errors out, please post the error. If the file is empty (except for the header line) then all your neoeptiopes got filtered out. You can see the unfiltered list of neoepitopes in the <sample_name>.combined.parsed.tsv file.

Leehyeonjin93 commented 6 years ago

thanks, but I didn't get error messages. so, I checked .combined.parsed.tsv. like this, sample_name combined parsed tsv this is a .combined.parsed.tsv file.

so, I checked input files. readcount (readcount file) vep vcf (vep.VCF file) cufflinks (cufflinks result file)

but, I don't know cause of problem.

susannasiebert commented 6 years ago

What does your pvacseq command look like? Are you providing a --additional-input-file-list?

Leehyeonjin93 commented 6 years ago

yes, this is my running command input

and this is my input log. inputcommand

thank you.

susannasiebert commented 6 years ago

Hm, I don't see anything obviously wrong. Can you attach the actual data files or email them to our help alias (help@pvactools.org) so that I can run this data through the debugger? It would also be sufficient to send me shortened data files that only contain the entries for variant chr19 58864361, since that is a case we expect to have readcounts attached.

susannasiebert commented 6 years ago

I received your files and ran them locally and am unable to replicate the problem. My additional_input_file list looks like this:

$ cat additional_input_file_list.yaml 
gene_expn_file: ch19.genes.tracking
transcript_expn_file: ch19.isoforms.tracking
tdna_snvs_coverage_file: BC16158_T_DNA_bam.snp.Somatic.chr19.sorted.readcount
normal_snvs_coverage_file: BC16158_N_DNA_bam.snp.Somatic.chr19.sorted.readcount
trna_snvs_coverage_file: BC16158_T_RNA_bam.snp.Somatic.chr19.sorted.readcount

The command I ran was: pvacseq run -e 9 -i additional_input_file_list.yaml BC16158_WES_alignmac.varscan2.snp.Somatic.ch19.vep.vcf BC16158 HLA-A*30:01 NetMHC BC16158 My results look like this. It shows that the vaf, coverage, and expression was parsed correctly for at least some of the samples. The rest were positions that were not in the additional input files:

$ cut -f1-3,15-22 BC16158/MHC_Class_I/BC16158.tsv
chromosome_name start   stop    transcript_expression   gene_expression normal_depth    normal_vaf  tdna_depth  tdna_vaf    trna_depth  trna_vaf
chr19   13879198    13879199    6.22641 16.7291 219 0.0 294 31.292515942431432  161 54.65838169823716
chr19   13879198    13879199    2.60674 16.7291 219 0.0 294 31.292515942431432  161 54.65838169823716
chr19   14877815    14877816    0.410364    2.59447 NA  NA  NA  NA  NA  NA
chr19   14877815    14877816    0.154975    2.59447 NA  NA  NA  NA  NA  NA
chr19   14877815    14877816    0.199177    2.59447 NA  NA  NA  NA  NA  NA
chr19   14877815    14877816    4.55976e-20 2.59447 NA  NA  NA  NA  NA  NA
chr19   14877815    14877816    0.0689099   2.59447 NA  NA  NA  NA  NA  NA
chr19   14877815    14877816    1.01028e-19 2.59447 NA  NA  NA  NA  NA  NA
chr19   14877815    14877816    0.179855    2.59447 NA  NA  NA  NA  NA  NA
chr19   14877815    14877816    0.0819899   2.59447 NA  NA  NA  NA  NA  NA
chr19   14877815    14877816    0.111585    2.59447 NA  NA  NA  NA  NA  NA
chr19   14877819    14877820    0.410364    2.59447 NA  NA  NA  NA  NA  NA
chr19   14877819    14877820    0.154975    2.59447 NA  NA  NA  NA  NA  NA
chr19   14877819    14877820    0.199177    2.59447 NA  NA  NA  NA  NA  NA
chr19   14877819    14877820    4.55976e-20 2.59447 NA  NA  NA  NA  NA  NA
chr19   14877819    14877820    0.0689099   2.59447 NA  NA  NA  NA  NA  NA
chr19   14877819    14877820    1.01028e-19 2.59447 NA  NA  NA  NA  NA  NA
chr19   14877819    14877820    0.179855    2.59447 NA  NA  NA  NA  NA  NA
chr19   14877819    14877820    0.0819899   2.59447 NA  NA  NA  NA  NA  NA
chr19   14877819    14877820    0.111585    2.59447 NA  NA  NA  NA  NA  NA
chr19   20727618    20727619    0.0941472   0.0941614   NA  NA  NA  NA  NA  NA
chr19   20727619    20727620    0.0941472   0.0941614   NA  NA  NA  NA  NA  NA
chr19   43579556    43579557    0   0.0 98  0.0 275 33.818180588429804  NA  NA
chr19   43579556    43579557    0   0.0 98  0.0 275 33.818180588429804  NA  NA
chr19   58864361    58864362    9.01375e-14 7.47729 89  0.0 181 38.12154485516327   8   62.49992187509766
chr19   58864361    58864362    9.72373e-52 7.47729 89  0.0 181 38.12154485516327   8   62.49992187509766

In the combined.parsed.file when I look for the last variant:

$ grep -h -E 'start|58864361' BC16158/MHC_Class_I/BC16158.combined.parsed.tsv | cut -f1-3,22-29 | head
chr19   58864361    58864362    181 38.12154485516327   8   62.49992187509766   89  0.0 7.47729 9.01375e-14
chr19   58864361    58864362    181 38.12154485516327   8   62.49992187509766   89  0.0 7.47729 9.72373e-52
chr19   58864361    58864362    181 38.12154485516327   8   62.49992187509766   89  0.0 7.47729 9.72373e-52
chr19   58864361    58864362    181 38.12154485516327   8   62.49992187509766   89  0.0 7.47729 9.01375e-14
chr19   58864361    58864362    181 38.12154485516327   8   62.49992187509766   89  0.0 7.47729 9.72373e-52
chr19   58864361    58864362    181 38.12154485516327   8   62.49992187509766   89  0.0 7.47729 9.01375e-14
chr19   58864361    58864362    181 38.12154485516327   8   62.49992187509766   89  0.0 7.47729 9.72373e-52
chr19   58864361    58864362    181 38.12154485516327   8   62.49992187509766   89  0.0 7.47729 9.01375e-14
chr19   58864361    58864362    181 38.12154485516327   8   62.49992187509766   89  0.0 7.47729 9.72373e-52
chr19   58864361    58864362    181 38.12154485516327   8   62.49992187509766   89  0.0 7.47729 9.01375e-14
susannasiebert commented 6 years ago

When you ran your pvacseq command the last time, was the output directory empty?

Leehyeonjin93 commented 6 years ago

oh... thank you.!! I knew what was problem. in my case, on additional_input_file_list.yaml, two lists were changed each other. like this, gene_expn_file: ch19.genes.tracking transcript_expn_file: ch19.isoforms.tracking normal_snvs_coverage_file: BC16158_N_DNA_bam.snp.Somatic.chr19.sorted.readcount tdna_snvs_coverage_file: BC16158_T_DNA_bam.snp.Somatic.chr19.sorted.readcount trna_snvs_coverage_file: BC16158_T_RNA_bam.snp.Somatic.chr19.sorted.readcount

so, I followed your command. then, removed all previous tsv file and tmp file.

finally, I got final files. thank you!