Open oleraj opened 4 years ago
One idea to try might be to use a CSI index instead of a TBI index and use a smaller minimum interval size. You could generate a new index with something like bcftools index --min-shift 9 /data/gnomad.genomes.r2.0.1.sites.vt.vcf.gz
, then rename/remove the old tabix index before running annotate
again so it doesn't accidentally get picked up.
This idea is based mostly on this: https://github.com/brentp/vcfanno/issues/44#issuecomment-351510778, which helped me a while ago when I was trying to annotate with CADD (another large annotation source).
I don't think csi index will make any difference, the problem is most likely related to these issues https://github.com/samtools/bcftools/issues/943, https://github.com/samtools/bcftools/issues/1047. Can you try with the --single-overlaps
option?
Thanks for the idea @garrettjstevens. I tried this and it still took ~2 hours to run.
Thanks @pd3, I'll try that next.
@pd3 I tested this option and it still took ~2 hours.
To see if it might be related to other features that have been added in the past I tested bcftools versions 1.2, 1.3, 1.4.1, 1.6 and 1.9. Versions 1.2 and 1.3 didn't work due to the --collapse
option missing but the other versions also took ~2 hours (using the VCF with 8 variants), so it seems to be a long-standing issue.
Sorry for the delay in responding. I looked at this again and realized what the problem is: when annotate
transfers annotations from one file into another, it uses htslib's synchronized reading API which streams through both files simultaneously and throws away sites that are not found in both files. This is of course not great for your application.
A quick and easy way around this is to create a region file and restrict to these regions:
bcftools query -f'%CHROM\t%POS\n' small.vcf.gz > sites.txt
bcftools annotate -R sites.txt ....
In addition, the gnomad VCF is big and therefore very slow to parse. The performance can be further improved by converting to a BCF first. In this case it will also reduce the size of the file by about half.
bcftools view gnomad.vcf.gz -Ob -o gnomad.bcf
Thanks @pd3, this is very helpful. I'll try out your suggestions.
Sorry for the delay in responding. I looked at this again and realized what the problem is: when
annotate
transfers annotations from one file into another, it uses htslib's synchronized reading API which streams through both files simultaneously and throws away sites that are not found in both files. This is of course not great for your application.A quick and easy way around this is to create a region file and restrict to these regions:
bcftools query -f'%CHROM\t%POS\n' small.vcf.gz > sites.txt bcftools annotate -R sites.txt ....
In addition, the gnomad VCF is big and therefore very slow to parse. The performance can be further improved by converting to a BCF first. In this case it will also reduce the size of the file by about half.
bcftools view gnomad.vcf.gz -Ob -o gnomad.bcf
It appears that this method doesn't always speed up and depends on the position of variants.
For example I have:
1kg_af.txt.gz
: bgzipped txt with columns (chromosome, start, end, alt, af) which contains all variants and AF from 1000G, ~0.5GB in sizeinput.vcf.gz
: a VCF with around 3200 variants on chromosome 1-22Without region filter, time taken is around 32 seconds:
bcftools annotate -a 1kg_af.txt.gz -h header.txt -c CHROM,FROM,TO,ALT,AF -Oz -o input.ann.vcf.gz input.vcf.gz
When filtered by all sites, time taken is still around 32 seconds:
bcftools query -f'%CHROM\t%POS\n' input.vcf.gz > sites.txt
bcftools annotate -R sites.txt ....
If filtered by first half of sites, time taken is reduced to around 21s:
head -n 1600 sites.txt > sites.half.txt
bcftools annotate -R sites.half.txt ....
Similarly, with first quarter of sites, time taken is further reduced to around 11s:
head -n 800 sites.txt > sites.quarter.txt
bcftools annotate -R sites.quarter.txt ....
However, if random quarter of sites are used as filter, time taken is still around 32s:
shuf sites.txt | head -n 800 | sort -k1,1 -k2,2n > sites.shuf.quarter.sort.txt
bcftools annotate -R sites.shuf.quarter.sort.txt ....
When only one site is considered per chromosome, time taken is just around 4s:
for i in {1..22}; do
grep -P "^$i\t" sites.txt | awk 'NR==1' >> sites.chrom1.txt;
done
bcftools annotate -R sites.chrom1.txt ....
However, if both first and last sites are considered per chromosome, then time taken remains at around 32s:
for i in {1..22}; do
grep -P "^$i\t" sites.txt | awk 'NR==1; END{print}' >> sites.chrom2.txt;
done
bcftools annotate -R sites.chrom2.txt ....
I am also have this same exact issue and it seems to me that there are some unexplained issues. I am using a ~9.2GB annotation file available here and using the following to generate a test VCF to be annotated:
(echo "##fileformat=VCFv4.2";
echo "##contig=<ID=chrX,length=156040895>";
echo -e "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO";
echo -e "chrX\t77361852\t.\tAATAT\tA\t.\t.\t.";
echo -e "chrX\t79395749\t.\tC\tT\t.\t.\t.";
echo -e "chrX\t80726216\t.\tTAACATATAACAC\tT\t.\t.\t.";
echo -e "chrX\t85220714\t.\tC\tA\t.\t.\t.";
echo -e "chrX\t86895704\t.\tC\tT\t.\t.\t.";
echo -e "chrX\t96540352\t.\tT\tG\t.\t.\t.";
echo -e "chrX\t103549770\t.\tT\tC\t.\t.\t.";
echo -e "chrX\t105539532\t.\tA\tG\t.\t.\t.";
echo -e "chrX\t109217387\t.\tA\tG\t.\t.\t.";
echo -e "chrX\t114616228\t.\tG\tGT\t.\t.\t.";
echo -e "chrX\t115690491\t.\tA\tG\t.\t.\t.";
echo -e "chrX\t117568750\t.\tG\tA\t.\t.\t.";
echo -e "chrX\t119754099\t.\tA\tAC\t.\t.\t.";
echo -e "chrX\t121549098\t.\tA\tG\t.\t.\t.";
echo -e "chrX\t124339104\t.\tT\tA\t.\t.\t.";
echo -e "chrX\t129769496\t.\tG\tT\t.\t.\t.";
echo -e "chrX\t132161909\t.\tC\tG\t.\t.\t.";
echo -e "chrX\t133786911\t.\tA\tT\t.\t.\t.";
echo -e "chrX\t135877733\t.\tG\tA\t.\t.\t.";
echo -e "chrX\t150353604\t.\tC\tG\t.\t.\t.";
echo -e "chrX\t154603140\t.\tC\tCTTAT\t.\t.\t.") | \
bcftools view --no-version -Ob -o sites.bcf && \
bcftools index --force sites.bcf
bcftools query -f'%CHROM\t%POS\n' sites.bcf > sites.txt
Then using the sites.txt
file makes things very quick:
$ time bcftools annotate --no-version -a GCF_000001405.39.GRCh38.bcf -c RS sites.bcf -R sites.txt -o /dev/null
real 0m0.063s
user 0m0.043s
sys 0m0.020s
Restricting to chrX
as a region is much slower, but still acceptable:
$ time bcftools annotate --no-version -a GCF_000001405.39.GRCh38.bcf -c RS sites.bcf -r chrX -o /dev/null
real 0m9.039s
user 0m8.957s
sys 0m0.081s
However, if I try without selecting a region, I get:
$ time bcftools annotate --no-version -a GCF_000001405.39.GRCh38.bcf -c RS sites.bcf -o /dev/null
real 3m44.256s
user 3m40.175s
sys 0m3.945s
Most notably, when I run the command:
$ bcftools annotate --no-version -a GCF_000001405.39.GRCh38.bcf -c RS sites.bcf
After approximately ten seconds I get the (almost) full output:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chrX,length=156040895>
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
#CHROM POS ID REF ALT QUAL FILTER INFO
chrX 77361852 . AATAT A . . RS=202118283
chrX 79395749 . C T . . RS=12559773
chrX 80726216 . TAACATATAACAC T . . RS=754378013
chrX 85220714 . C A . . RS=2188732
chrX 86895704 . C T . . RS=756526961
chrX 96540352 . T G . . RS=57343194
chrX 103549770 . T C . . RS=1997648
chrX 105539532 . A G . . RS=5962543
chrX 109217387 . A G . . RS=200106209
chrX 114616228 . G GT . . RS=781786026
chrX 115690491 . A G . . RS=12836051
chrX 117568750 . G A . . RS=149942029
chrX 119754099 . A AC . . RS=75284556
chrX 121549098 . A G . . RS=191376079
chrX 124339104 . T A . . RS=200787584
chrX 129769496 . G T . . RS=2281133
chrX 132161909 . C G . . RS=149590924
chrX 133786911 . A T . . RS=144670363
chrX 135877733 . G A . . RS=1218635101
chrX 150353604 . C G . . RS=5925454
chrX 154603140 . C CTTAT . . RS=20165898
(one character is still missing at the end as the output is buffered). And yet, the process will take several minutes to complete and bcftools is using 100% of the CPU meanwhile. What is HTSlib/BCFtools doing? It seems like streaming chromosome X through both files simultaneously completes in ten seconds but after that something else that is not necessary and time consuming is happening. My guess is that HTSlib/BCFtools did not realize the first file reached the end and it is streaming through the whole of the second file. Is this the expected behavior? And when -r/-R
is not used why does bcftools annotate require the VCFs to be indexed if it is going to parse through all of them anyway?
The problem was reproducible on separate Linux machines using bcftools 1.14 and bcftools 1.15.1
I'm having this issue exactly as freeseek describes in the last comment, with bcftools 1.17 on CentOS Linux 7. Is there any progress on debugging this, or working around it?
Edit: apologies, skimmed a bit too fast and didn't see that some workarounds had already been suggested.
I believe a work around was given in https://github.com/samtools/bcftools/issues/1199#issuecomment-624713745, did you try it?
The program does not use index but streams both files which is a problem when one is huge. Forcing the use of the index with -R
should prevent that.
I understand it is a nuisance, but is not easy to address, it would require a heuristics to automatically choose between streaming (-T
) vs index-hopping (-R
). Will mark this as enhancement.
Since my destination files were per-chromosome, -r [chromosome name]
ended up working for me. The "getting stuck at the end" phenomenon still seems like a bug to me though, since it's clearly able to find the start of the chromosome immediately, but it doesn't "realize" that it's finished. Can't complain too much though since the simple workaround did entirely solve the problem for me.
IIRC, a few months ago I was annotating a single file with all chromosomes and -R with the sites file failed to speed it up (as https://github.com/samtools/bcftools/issues/1199#issuecomment-656032075 describes). However that may have been an older version of bcftools so don't necessarily take my word on it.
Hi,
I'm trying to annotate a VCF with only 248 variants using the gnomad genomes VCF as annotation and it takes about 2 hours to run. Both the input VCF and the annotation VCF are indexed with tabix so I'm not sure why it would need to take such a long time. Oddly enough, when I annotate a VCF with 24K variants using the same gnomad VCF as annotation it also takes about 2 hours to run. And the same if I start with a VCF with only 8 variants. If each variant is looked up in the annotation VCF using the index, it seems like the speed would be proportional to the size of the input VCF. Is the entire annotation VCF stored in memory first before the annotation is done? Is there any way to speed up the annotation?
Here's my command:
I'm using bcftools 1.10. FYI, the gnomad genomes VCF is about 84GB and the tabix index is 2.7MB.
Thanks,
Andrew