twolinin / longphase

GNU General Public License v3.0
99 stars 9 forks source link

output vcf files from phasing contain many repetitive header lines #30

Open ilivyatan opened 10 months ago

ilivyatan commented 10 months ago

Hi, This is a really great tool and works very fast, thanks! We've been trying it out in different ways and many times the resulting phased VCF files contain multiple lines of headers in the middle of the files. For example, these lines appear 913 times in one file!

FORMAT=

longphaseVersion=1.5.1

commandline="phase -s data/27909_pass.wf_snp.vcf.gz --sv-file data/27909_pass.wf_sv.vcf.gz --mod-file mod_27909.vcf -b data/27909_pass_merged.bam -r data/GRCh38.no_alt_analysis_set.fasta -t 16 -o phased_27909_mod --indels --ont "

This also occurs with joint phasing of just SNV and SV vcf files. Why do comments appear in the middle of the output VCF files?

twolinin commented 10 months ago

Hi @ilivyatan,

Can you provide the header for the 27909_pass.wf_snp.vcf.gz?

Thanks

ilivyatan commented 10 months ago

head_27909.vcf.txt

ilivyatan commented 10 months ago

Any updates? Is the header readable?

twolinin commented 10 months ago

Hi @ilivyatan

The header file you provided checks out fine. In the next version, additional conditions will be added to try to prevent issues with repetitive headers.

ilivyatan commented 10 months ago

The attached file is just the excerpt of the header up until the first column header line:

chrom start end.... line

AFTER that line and after the data starts there are 913 additional header-like lines INTERLEAVED within, and running all throughout the file. We just parsed them out and carried on with the VCF, but they shouldn't be there.

twolinin commented 10 months ago
  1. Based on your command, which VCF file has repetitive headers? phased_27909_mod.vcf or phased_27909_mod_mod.vcf
  2. Can you use grep -n "#" phased_27909_mod.vcf to provide the problematic headers?
ilivyatan commented 10 months ago

It's in phased_27909_mod.vcf The (grep) output you requested is attached : phased_27909_mod_with_interleaved_comments.vcf.txt

twolinin commented 10 months ago

I noticed that the following line is missing in the phased header file you provided. Could you please confirm this for me?

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  27909_pass

By the way, could you test with the latest version, v1.5.2, to see if the same issue?

wget https://github.com/twolinin/longphase/releases/download/v1.5.2/longphase_linux-x64.tar.xz
tar -xJf longphase_linux-x64.tar.xz
SamStudio8 commented 6 months ago

We have also encountered this problem. I've taken a look at SnpParser::writeLine and the conditions required to reach the line that writes out the longphase headers appear to be:

So it looks as though longphase will write out these header lines to resultVcf any time it finds # ANYWHERE in the record. Indeed we've directly observed one longphaseVersion header for each input line that contains a # in the INFO field:

$ zgrep -c 'longphaseVersion' tmp_chr2.vcf.gz
78
$ zgrep -v '^#' tmp_chr2.vcf.gz | grep -c '#'
76

I'd suggest if looking for header lines that these checks specifically look for #/## at the beginning of the input string.

ythuang0522 commented 6 months ago

Hi @SamStudio8

Thank you very much for digging into this issue. We should limit it to the beginning of each line as you suggested. It will be fixed in the next version together with other features in two weeks.

Thanks, Yao-Ting

ythuang0522 commented 5 months ago

@SamStudio8 @ilivyatan The v1.7 release has resolved this repeated header issue. We forgot to mention this issue in the release notes, but the SnpParser::writeLine has been rewritten as suggested by Sam.