Closed goeckeritz closed 3 months ago
This warning message looks important
[W::bcf_sr_add_reader] No BGZF EOF marker; file <file> may be truncated
I suspect one of the commands in your pipeline failed to produce an output, or crashed, leaving the rest with corrupted input. I will not try to analyze your pipeline, but suggest to do what I'd do with a problem like this: split it into small consecutive steps. For example, split
bcftools cmd1 | bcftools cmd2 | bcftools cmd3
into
bcftools cmd1 -o out1.bcf
bcftools cmd2 -o out2.bcf out1.bcf
bcftools cmd3 -o out3.bcf out2.bcf
and look for errors in each step separately.
Hi Petr,
I'll give this a shot and get back to you! Thanks so much for the response.
Kindly, Charity
Hi Petr,
As you suggested, I ran each step of my pipeline separately. And as I suspected, I didn't receive any errors until I got to the concat command. The [W::bcf_sr_add_reader] No BGZF EOF marker; file <file> may be truncated
is also always preceded with the contiguous error: The chromosome block chr1A is not contiguous, consider running with -a.
When I add the -a flag to the concat command, both of these errors disappear, and the file sizes of MERGED are more on par with what I'd expect for the combined REF + ALT files. So... do you think there is any cause for concern?
As a reminder, the last two commands of my pipeline are:
bcftools concat --threads 12 -Oz -a ${GENO}_ALT.vcf.gz ${GENO}_REF.vcf.gz > ${GENO}_MERGED.vcf.gz bcftools index -f --threads 12 ${GENO}_MERGED.vcf.gz
Thanks again for all of your help! Kindly, Charity
I checked the commands again. You seem to be running the same calling twice, then filtering the result in two different ways to obtain two files. Then you want to merge these files "vertically". That's not the best way of doing it.
First, the mpileup
step is computationally expensive, unless your data is small I'd run it only once and then work with the resulting file
Second, you hard-filter the file in two ways, then you are merging them. This way you'll get many duplicate positions which will lead to problems. Better is the soft-filtering functionality of bcftools filter -s FilterName
. This was you can chain two filter
commands, possibly followed with view -i 'FILTER="PASS"'
, and obtain the desired output immediately.
I hope this helps!
Thanks Petr, this makes a lot of sense.
Hi Petr,
I decided to open a new issue since I'm having several problems, actually.
I am also having the contiguous issue for bcftools concat (issue #1105), despite piping the files I'm trying to combine through bcftools sort. These are the commands I'm using:
FILE 1:
bcftools mpileup --threads 12 -Oz -a FORMAT/AD -f ${REFERENCE} ${GENO}_sorted.bam | bcftools call --threads 12 -Oz -m -a GQ | bcftools +fill-tags -Oz --threads 12 -- -t FORMAT/VAF | bcftools view --threads 12 --with-header -Oz -M2 | bcftools filter --threads 12 -i "QUAL>50 & (DP4[0]+DP4[1]+DP4[2]+DP4[3])>12 & (DP4[0]+DP4[1]+DP4[2]+DP4[3])<30 & GQ>30 & ((FORMAT/VAF>0.2 & FORMAT/VAF<0.8) | FORMAT/VAF=0)" | bcftools view --threads 12 --with-header -i 'TYPE="snp"' | bcftools sort -m 190G -Oz > /cluster/home/cgoeckeritz/bwa_full/embryos/vcfs/${GENO}_ALT.vcf.gz
FILE 2:
bcftools mpileup --threads 12 -Oz -a FORMAT/AD -f ${REFERENCE} ${GENO}_sorted.bam | bcftools call --threads 12 -Oz -m -a GQ | bcftools +fill-tags -Oz --threads 12 -- -t FORMAT/VAF | bcftools view --threads 12 --with-header -Oz -M2 | bcftools filter --threads 12 -i "QUAL>50 & (DP4[0]+DP4[1]+DP4[2]+DP4[3])>12 & (DP4[0]+DP4[1]+DP4[2]+DP4[3])<30" | bcftools view --threads 12 --with-header -i 'GT="RR"' | bcftools sort -m 190G -Oz > /cluster/home/cgoeckeritz/bwa_full/embryos/vcfs/${GENO}_REF.vcf.gz
I also made an index for each.Then, to merge the files, I ran:
bcftools concat --threads 12 -Oz /cluster/home/cgoeckeritz/bwa_full/embryos/vcfs/${GENO}_ALT.vcf.gz /cluster/home/cgoeckeritz/bwa_full/embryos/vcfs/${GENO}_REF.vcf.gz > /cluster/home/cgoeckeritz/bwa_full/embryos/vcfs/${GENO}_MERGED.vcf.gz
Doing the command
bcftools query -f '%CHROM\t%POS\n'
on the MERGED outputs every single variant in the file. Moreover, every line is unique, so I know there is at least no repeated locations, which is good... because that wouldn't make a ton of sense.Nonetheless, I am getting this contiguous error for the first chromosome. I can add the -a option, but I'm worried that will cause some issues if I were to then use gtcheck, which is later in my pipeline. Also, when I run the suggested
bcftools query
command, I get the error (or is it a warning?)[W::bcf_sr_add_reader] No BGZF EOF marker; file <file> may be truncated
, and the index fails to be made. It suggests I use the -f option to write it anyway, but should I be concerned about this?The last issue I'm having -- and perhaps the most concerning -- is that the ALT file and the MERGED file are almost identical sizes or MERGED is a bit smaller, when the MERGED should be much larger since it should be ~sum of the ALT and REF files.
Any thoughts / comments on why I am getting these errors would be much appreciated - and thank you so much for your time!
Kindly, Charity
Originally posted by @goeckeritz in https://github.com/samtools/bcftools/issues/1105#issuecomment-2236872354