bcftools concat issues...

goeckeritz commented 3 months ago

Hi Petr,

I decided to open a new issue since I'm having several problems, actually.

I am also having the contiguous issue for bcftools concat (issue #1105), despite piping the files I'm trying to combine through bcftools sort. These are the commands I'm using:

FILE 1: bcftools mpileup --threads 12 -Oz -a FORMAT/AD -f ${REFERENCE} ${GENO}_sorted.bam | bcftools call --threads 12 -Oz -m -a GQ | bcftools +fill-tags -Oz --threads 12 -- -t FORMAT/VAF | bcftools view --threads 12 --with-header -Oz -M2 | bcftools filter --threads 12 -i "QUAL>50 & (DP4[0]+DP4[1]+DP4[2]+DP4[3])>12 & (DP4[0]+DP4[1]+DP4[2]+DP4[3])<30 & GQ>30 & ((FORMAT/VAF>0.2 & FORMAT/VAF<0.8) | FORMAT/VAF=0)" | bcftools view --threads 12 --with-header -i 'TYPE="snp"' | bcftools sort -m 190G -Oz > /cluster/home/cgoeckeritz/bwa_full/embryos/vcfs/${GENO}_ALT.vcf.gz

FILE 2: bcftools mpileup --threads 12 -Oz -a FORMAT/AD -f ${REFERENCE} ${GENO}_sorted.bam | bcftools call --threads 12 -Oz -m -a GQ | bcftools +fill-tags -Oz --threads 12 -- -t FORMAT/VAF | bcftools view --threads 12 --with-header -Oz -M2 | bcftools filter --threads 12 -i "QUAL>50 & (DP4[0]+DP4[1]+DP4[2]+DP4[3])>12 & (DP4[0]+DP4[1]+DP4[2]+DP4[3])<30" | bcftools view --threads 12 --with-header -i 'GT="RR"' | bcftools sort -m 190G -Oz > /cluster/home/cgoeckeritz/bwa_full/embryos/vcfs/${GENO}_REF.vcf.gz I also made an index for each.

Then, to merge the files, I ran: bcftools concat --threads 12 -Oz /cluster/home/cgoeckeritz/bwa_full/embryos/vcfs/${GENO}_ALT.vcf.gz /cluster/home/cgoeckeritz/bwa_full/embryos/vcfs/${GENO}_REF.vcf.gz > /cluster/home/cgoeckeritz/bwa_full/embryos/vcfs/${GENO}_MERGED.vcf.gz

Doing the command bcftools query -f '%CHROM\t%POS\n' on the MERGED outputs every single variant in the file. Moreover, every line is unique, so I know there is at least no repeated locations, which is good... because that wouldn't make a ton of sense.

Nonetheless, I am getting this contiguous error for the first chromosome. I can add the -a option, but I'm worried that will cause some issues if I were to then use gtcheck, which is later in my pipeline. Also, when I run the suggested bcftools query command, I get the error (or is it a warning?) [W::bcf_sr_add_reader] No BGZF EOF marker; file <file> may be truncated, and the index fails to be made. It suggests I use the -f option to write it anyway, but should I be concerned about this?

The last issue I'm having -- and perhaps the most concerning -- is that the ALT file and the MERGED file are almost identical sizes or MERGED is a bit smaller, when the MERGED should be much larger since it should be ~sum of the ALT and REF files.

Any thoughts / comments on why I am getting these errors would be much appreciated - and thank you so much for your time!

Kindly, Charity

Originally posted by @goeckeritz in https://github.com/samtools/bcftools/issues/1105#issuecomment-2236872354

pd3 commented 3 months ago

This warning message looks important

[W::bcf_sr_add_reader] No BGZF EOF marker; file <file> may be truncated

I suspect one of the commands in your pipeline failed to produce an output, or crashed, leaving the rest with corrupted input. I will not try to analyze your pipeline, but suggest to do what I'd do with a problem like this: split it into small consecutive steps. For example, split

bcftools cmd1 | bcftools cmd2 | bcftools cmd3

into

bcftools cmd1 -o out1.bcf
bcftools cmd2 -o out2.bcf out1.bcf
bcftools cmd3 -o out3.bcf out2.bcf

and look for errors in each step separately.

goeckeritz commented 3 months ago

Hi Petr,

I'll give this a shot and get back to you! Thanks so much for the response.

Kindly, Charity

goeckeritz commented 3 months ago

Hi Petr,

As you suggested, I ran each step of my pipeline separately. And as I suspected, I didn't receive any errors until I got to the concat command. The [W::bcf_sr_add_reader] No BGZF EOF marker; file <file> may be truncated is also always preceded with the contiguous error: The chromosome block chr1A is not contiguous, consider running with -a.

When I add the -a flag to the concat command, both of these errors disappear, and the file sizes of MERGED are more on par with what I'd expect for the combined REF + ALT files. So... do you think there is any cause for concern?

As a reminder, the last two commands of my pipeline are: bcftools concat --threads 12 -Oz -a ${GENO}_ALT.vcf.gz ${GENO}_REF.vcf.gz > ${GENO}_MERGED.vcf.gz bcftools index -f --threads 12 ${GENO}_MERGED.vcf.gz

Thanks again for all of your help! Kindly, Charity

pd3 commented 3 months ago

I checked the commands again. You seem to be running the same calling twice, then filtering the result in two different ways to obtain two files. Then you want to merge these files "vertically". That's not the best way of doing it.

First, the mpileup step is computationally expensive, unless your data is small I'd run it only once and then work with the resulting file

Second, you hard-filter the file in two ways, then you are merging them. This way you'll get many duplicate positions which will lead to problems. Better is the soft-filtering functionality of bcftools filter -s FilterName. This was you can chain two filter commands, possibly followed with view -i 'FILTER="PASS"', and obtain the desired output immediately.

I hope this helps!

goeckeritz commented 2 months ago

Thanks Petr, this makes a lot of sense.

samtools / bcftools

bcftools concat issues... #2228