samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
679 stars 239 forks source link

bcftools annotate using large annotation VCF is slow regardless of input VCF size #1199

Open oleraj opened 4 years ago

oleraj commented 4 years ago

Hi,

I'm trying to annotate a VCF with only 248 variants using the gnomad genomes VCF as annotation and it takes about 2 hours to run. Both the input VCF and the annotation VCF are indexed with tabix so I'm not sure why it would need to take such a long time. Oddly enough, when I annotate a VCF with 24K variants using the same gnomad VCF as annotation it also takes about 2 hours to run. And the same if I start with a VCF with only 8 variants. If each variant is looked up in the annotation VCF using the index, it seems like the speed would be proportional to the size of the input VCF. Is the entire annotation VCF stored in memory first before the annotation is done? Is there any way to speed up the annotation?

Here's my command:

bcftools annotate -a /data/gnomad.genomes.r2.0.1.sites.vt.vcf.gz --collapse 'none' -c "aaf_gnomad_gen:=AF,gnomad_gen_popmax_aaf:=AF_POPMAX,gnomad_gen_popmax:=POPMAX" -O z -o test.gnomad.vcf.gz  test.vcf.gz

I'm using bcftools 1.10. FYI, the gnomad genomes VCF is about 84GB and the tabix index is 2.7MB.

Thanks,

Andrew

garrettjstevens commented 4 years ago

One idea to try might be to use a CSI index instead of a TBI index and use a smaller minimum interval size. You could generate a new index with something like bcftools index --min-shift 9 /data/gnomad.genomes.r2.0.1.sites.vt.vcf.gz, then rename/remove the old tabix index before running annotate again so it doesn't accidentally get picked up.

This idea is based mostly on this: https://github.com/brentp/vcfanno/issues/44#issuecomment-351510778, which helped me a while ago when I was trying to annotate with CADD (another large annotation source).

pd3 commented 4 years ago

I don't think csi index will make any difference, the problem is most likely related to these issues https://github.com/samtools/bcftools/issues/943, https://github.com/samtools/bcftools/issues/1047. Can you try with the --single-overlaps option?

oleraj commented 4 years ago

Thanks for the idea @garrettjstevens. I tried this and it still took ~2 hours to run.
Thanks @pd3, I'll try that next.

oleraj commented 4 years ago

@pd3 I tested this option and it still took ~2 hours.

To see if it might be related to other features that have been added in the past I tested bcftools versions 1.2, 1.3, 1.4.1, 1.6 and 1.9. Versions 1.2 and 1.3 didn't work due to the --collapse option missing but the other versions also took ~2 hours (using the VCF with 8 variants), so it seems to be a long-standing issue.

pd3 commented 4 years ago

Sorry for the delay in responding. I looked at this again and realized what the problem is: when annotate transfers annotations from one file into another, it uses htslib's synchronized reading API which streams through both files simultaneously and throws away sites that are not found in both files. This is of course not great for your application.

A quick and easy way around this is to create a region file and restrict to these regions:

bcftools query -f'%CHROM\t%POS\n' small.vcf.gz > sites.txt
bcftools annotate -R sites.txt ....

In addition, the gnomad VCF is big and therefore very slow to parse. The performance can be further improved by converting to a BCF first. In this case it will also reduce the size of the file by about half.

bcftools view gnomad.vcf.gz -Ob -o gnomad.bcf
oleraj commented 4 years ago

Thanks @pd3, this is very helpful. I'll try out your suggestions.

lacek commented 4 years ago

Sorry for the delay in responding. I looked at this again and realized what the problem is: when annotate transfers annotations from one file into another, it uses htslib's synchronized reading API which streams through both files simultaneously and throws away sites that are not found in both files. This is of course not great for your application.

A quick and easy way around this is to create a region file and restrict to these regions:

bcftools query -f'%CHROM\t%POS\n' small.vcf.gz > sites.txt
bcftools annotate -R sites.txt ....

In addition, the gnomad VCF is big and therefore very slow to parse. The performance can be further improved by converting to a BCF first. In this case it will also reduce the size of the file by about half.

bcftools view gnomad.vcf.gz -Ob -o gnomad.bcf

It appears that this method doesn't always speed up and depends on the position of variants.

For example I have:

Without region filter, time taken is around 32 seconds:

bcftools annotate -a 1kg_af.txt.gz -h header.txt -c CHROM,FROM,TO,ALT,AF -Oz -o input.ann.vcf.gz input.vcf.gz

When filtered by all sites, time taken is still around 32 seconds:

bcftools query -f'%CHROM\t%POS\n' input.vcf.gz > sites.txt
bcftools annotate -R sites.txt ....

If filtered by first half of sites, time taken is reduced to around 21s:

head -n 1600 sites.txt > sites.half.txt
bcftools annotate -R sites.half.txt ....

Similarly, with first quarter of sites, time taken is further reduced to around 11s:

head -n 800 sites.txt > sites.quarter.txt
bcftools annotate -R sites.quarter.txt ....

However, if random quarter of sites are used as filter, time taken is still around 32s:

shuf sites.txt | head -n 800 | sort -k1,1 -k2,2n > sites.shuf.quarter.sort.txt
bcftools annotate -R sites.shuf.quarter.sort.txt ....

When only one site is considered per chromosome, time taken is just around 4s:

for i in {1..22}; do
    grep -P "^$i\t" sites.txt | awk 'NR==1' >> sites.chrom1.txt;
done
bcftools annotate -R sites.chrom1.txt ....

However, if both first and last sites are considered per chromosome, then time taken remains at around 32s:

for i in {1..22}; do
    grep -P "^$i\t" sites.txt | awk 'NR==1; END{print}' >> sites.chrom2.txt;
done
bcftools annotate -R sites.chrom2.txt ....
freeseek commented 2 years ago

I am also have this same exact issue and it seems to me that there are some unexplained issues. I am using a ~9.2GB annotation file available here and using the following to generate a test VCF to be annotated:

(echo "##fileformat=VCFv4.2";
echo "##contig=<ID=chrX,length=156040895>";
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO";
echo -e "chrX\t77361852\t.\tAATAT\tA\t.\t.\t.";
echo -e "chrX\t79395749\t.\tC\tT\t.\t.\t.";
echo -e "chrX\t80726216\t.\tTAACATATAACAC\tT\t.\t.\t.";
echo -e "chrX\t85220714\t.\tC\tA\t.\t.\t.";
echo -e "chrX\t86895704\t.\tC\tT\t.\t.\t.";
echo -e "chrX\t96540352\t.\tT\tG\t.\t.\t.";
echo -e "chrX\t103549770\t.\tT\tC\t.\t.\t.";
echo -e "chrX\t105539532\t.\tA\tG\t.\t.\t.";
echo -e "chrX\t109217387\t.\tA\tG\t.\t.\t.";
echo -e "chrX\t114616228\t.\tG\tGT\t.\t.\t.";
echo -e "chrX\t115690491\t.\tA\tG\t.\t.\t.";
echo -e "chrX\t117568750\t.\tG\tA\t.\t.\t.";
echo -e "chrX\t119754099\t.\tA\tAC\t.\t.\t.";
echo -e "chrX\t121549098\t.\tA\tG\t.\t.\t.";
echo -e "chrX\t124339104\t.\tT\tA\t.\t.\t.";
echo -e "chrX\t129769496\t.\tG\tT\t.\t.\t.";
echo -e "chrX\t132161909\t.\tC\tG\t.\t.\t.";
echo -e "chrX\t133786911\t.\tA\tT\t.\t.\t.";
echo -e "chrX\t135877733\t.\tG\tA\t.\t.\t.";
echo -e "chrX\t150353604\t.\tC\tG\t.\t.\t.";
echo -e "chrX\t154603140\t.\tC\tCTTAT\t.\t.\t.") | \
  bcftools view --no-version -Ob -o sites.bcf && \
  bcftools index --force sites.bcf
bcftools query -f'%CHROM\t%POS\n' sites.bcf > sites.txt

Then using the sites.txt file makes things very quick:

$ time bcftools annotate --no-version -a GCF_000001405.39.GRCh38.bcf -c RS sites.bcf -R sites.txt -o /dev/null

real    0m0.063s
user    0m0.043s
sys 0m0.020s

Restricting to chrX as a region is much slower, but still acceptable:

$ time bcftools annotate --no-version -a GCF_000001405.39.GRCh38.bcf -c RS sites.bcf -r chrX -o /dev/null

real    0m9.039s
user    0m8.957s
sys 0m0.081s

However, if I try without selecting a region, I get:

$ time bcftools annotate --no-version -a GCF_000001405.39.GRCh38.bcf -c RS sites.bcf -o /dev/null

real    3m44.256s
user    3m40.175s
sys 0m3.945s

Most notably, when I run the command:

$ bcftools annotate --no-version -a GCF_000001405.39.GRCh38.bcf -c RS sites.bcf

After approximately ten seconds I get the (almost) full output:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chrX,length=156040895>
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chrX    77361852    .   AATAT   A   .   .   RS=202118283
chrX    79395749    .   C   T   .   .   RS=12559773
chrX    80726216    .   TAACATATAACAC   T   .   .   RS=754378013
chrX    85220714    .   C   A   .   .   RS=2188732
chrX    86895704    .   C   T   .   .   RS=756526961
chrX    96540352    .   T   G   .   .   RS=57343194
chrX    103549770   .   T   C   .   .   RS=1997648
chrX    105539532   .   A   G   .   .   RS=5962543
chrX    109217387   .   A   G   .   .   RS=200106209
chrX    114616228   .   G   GT  .   .   RS=781786026
chrX    115690491   .   A   G   .   .   RS=12836051
chrX    117568750   .   G   A   .   .   RS=149942029
chrX    119754099   .   A   AC  .   .   RS=75284556
chrX    121549098   .   A   G   .   .   RS=191376079
chrX    124339104   .   T   A   .   .   RS=200787584
chrX    129769496   .   G   T   .   .   RS=2281133
chrX    132161909   .   C   G   .   .   RS=149590924
chrX    133786911   .   A   T   .   .   RS=144670363
chrX    135877733   .   G   A   .   .   RS=1218635101
chrX    150353604   .   C   G   .   .   RS=5925454
chrX    154603140   .   C   CTTAT   .   .   RS=20165898

(one character is still missing at the end as the output is buffered). And yet, the process will take several minutes to complete and bcftools is using 100% of the CPU meanwhile. What is HTSlib/BCFtools doing? It seems like streaming chromosome X through both files simultaneously completes in ten seconds but after that something else that is not necessary and time consuming is happening. My guess is that HTSlib/BCFtools did not realize the first file reached the end and it is streaming through the whole of the second file. Is this the expected behavior? And when -r/-R is not used why does bcftools annotate require the VCFs to be indexed if it is going to parse through all of them anyway?

The problem was reproducible on separate Linux machines using bcftools 1.14 and bcftools 1.15.1

dvg-p4 commented 1 year ago

I'm having this issue exactly as freeseek describes in the last comment, with bcftools 1.17 on CentOS Linux 7. Is there any progress on debugging this, or working around it?

Edit: apologies, skimmed a bit too fast and didn't see that some workarounds had already been suggested.

pd3 commented 1 year ago

I believe a work around was given in https://github.com/samtools/bcftools/issues/1199#issuecomment-624713745, did you try it?

The program does not use index but streams both files which is a problem when one is huge. Forcing the use of the index with -R should prevent that.

I understand it is a nuisance, but is not easy to address, it would require a heuristics to automatically choose between streaming (-T) vs index-hopping (-R). Will mark this as enhancement.

dvg-p4 commented 1 year ago

Since my destination files were per-chromosome, -r [chromosome name] ended up working for me. The "getting stuck at the end" phenomenon still seems like a bug to me though, since it's clearly able to find the start of the chromosome immediately, but it doesn't "realize" that it's finished. Can't complain too much though since the simple workaround did entirely solve the problem for me.

IIRC, a few months ago I was annotating a single file with all chromosomes and -R with the sites file failed to speed it up (as https://github.com/samtools/bcftools/issues/1199#issuecomment-656032075 describes). However that may have been an older version of bcftools so don't necessarily take my word on it.