Open ilivyatan opened 1 year ago
Hi @ilivyatan,
Can you provide the header for the 27909_pass.wf_snp.vcf.gz?
Thanks
Any updates? Is the header readable?
Hi @ilivyatan
The header file you provided checks out fine. In the next version, additional conditions will be added to try to prevent issues with repetitive headers.
The attached file is just the excerpt of the header up until the first column header line:
AFTER that line and after the data starts there are 913 additional header-like lines INTERLEAVED within, and running all throughout the file. We just parsed them out and carried on with the VCF, but they shouldn't be there.
phased_27909_mod.vcf
or phased_27909_mod_mod.vcf
grep -n "#" phased_27909_mod.vcf
to provide the problematic headers?It's in phased_27909_mod.vcf The (grep) output you requested is attached : phased_27909_mod_with_interleaved_comments.vcf.txt
I noticed that the following line is missing in the phased header file you provided. Could you please confirm this for me?
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 27909_pass
By the way, could you test with the latest version, v1.5.2, to see if the same issue?
wget https://github.com/twolinin/longphase/releases/download/v1.5.2/longphase_linux-x64.tar.xz
tar -xJf longphase_linux-x64.tar.xz
We have also encountered this problem. I've taken a look at SnpParser::writeLine
and the conditions required to reach the line that writes out the longphase headers appear to be:
#
is present on the input line##
is not present on the input line and commandLine
is set to somethingSo it looks as though longphase
will write out these header lines to resultVcf
any time it finds #
ANYWHERE in the record. Indeed we've directly observed one longphaseVersion
header for each input line that contains a #
in the INFO field:
$ zgrep -c 'longphaseVersion' tmp_chr2.vcf.gz
78
$ zgrep -v '^#' tmp_chr2.vcf.gz | grep -c '#'
76
I'd suggest if looking for header lines that these checks specifically look for #
/##
at the beginning of the input string.
Hi @SamStudio8
Thank you very much for digging into this issue. We should limit it to the beginning of each line as you suggested. It will be fixed in the next version together with other features in two weeks.
Thanks, Yao-Ting
@SamStudio8 @ilivyatan The v1.7 release has resolved this repeated header issue. We forgot to mention this issue in the release notes, but the SnpParser::writeLine has been rewritten as suggested by Sam.
Hi, This is a really great tool and works very fast, thanks! We've been trying it out in different ways and many times the resulting phased VCF files contain multiple lines of headers in the middle of the files. For example, these lines appear 913 times in one file!
FORMAT=
longphaseVersion=1.5.1
commandline="phase -s data/27909_pass.wf_snp.vcf.gz --sv-file data/27909_pass.wf_sv.vcf.gz --mod-file mod_27909.vcf -b data/27909_pass_merged.bam -r data/GRCh38.no_alt_analysis_set.fasta -t 16 -o phased_27909_mod --indels --ont "
This also occurs with joint phasing of just SNV and SV vcf files. Why do comments appear in the middle of the output VCF files?