Closed Hoeze closed 4 months ago
hey @Hoeze ten days seems excessive! I would not expect it to take this long
How many variants and genotypes are in your VCF? Please checkpoint to parquet/delta after the initial read of the VCF. Performance will be much better if you treat ingest separately from any ETL steps (normalize / split).
Either way, these QC transformers are the most expensive step in the glow pipeline, see benchmarks on 500k samples / 250k variants: https://glow.readthedocs.io/en/latest/benchmarks.html
@kianfar77 what do you think? Is there a simple way to optimize this process?
spoke to @kianfar77,
best performance will be achieved when each step is run independently
The bottleneck is most likely in split_multialleleics. It has a nested spark transform function with a udf which is unavoidable and it is quite slow in spark. This is an area we can look at improving. But the majority of variants are not multiallelic. If you follow steps 1-4 for now you will get good results (not on par with bcftools, but not 10 days either)
@Hoeze pulled together a notebook that breaks out each step https://github.com/projectglow/glow/pull/455
This utilizes the append
feature of delta lake, but can be adapted for parquet
Thanks a lot for putting so much effort into this issue @williambrandler! The notebook is really helpful :+1:
Another idea I had was to call bcftools
with glow.transform
.
Do you think this creates any issues over the built-in transformers?
For the record, here the stats of the vcf:
# bcftools index -s als/genomics/4_JointGenotyping/AnswerALS-866-G-v1-release5_joint-vcf-vqsr-annotated.vcf.gz
# name, length, nr. of records
chr1 248956422 3359080
chr2 242193529 3630901
chr3 198295559 3050269
chr4 190214555 2960827
chr5 181538259 2762725
chr6 170805979 2664879
chr7 159345973 2442933
chr8 145138636 2397787
chr9 138394717 1816266
chr10 133797422 2089000
chr11 135086622 2076529
chr12 133275309 2034996
chr13 114364328 1511920
chr14 107043718 1373426
chr15 101991189 1233256
chr16 90338345 1360226
chr17 83257441 1202475
chr18 80373285 1186463
chr19 58617616 954582
chr20 64444167 970779
chr21 46709983 557983
chr22 50818468 569685
chrX 156040895 1442073
That's a total of 43,649,060 records.
You can use bcftools with the Glow Pipe Transformer
I actually have a notebook example of how to use bcftools to filter vcfs with the pipe transformer, which you can adapt, but not open sourced it yet. Drop an email to glow.contributors@gmail.com and I will send it over
Cheers
You can use bcftools with the Glow Pipe Transformer
I actually have a notebook example of how to use bcftools to filter vcfs with the pipe transformer, which you can adapt, but not open sourced it yet. Drop an email to glow.contributors@gmail.com and I will send it over
Cheers
@williambrandler - have you tried using bcftools with glow pipe to merge VCF files?
Hi, is it possible that the split and normalize transformers are significantly slower than
bcftools norm
?We just tried splitting a large 190GB VCF file with 800 samples from https://dataportal.answerals.org/home :
Running on 64 cores, 250GB RAM, this would have taken ~10 days, no matter if we disable whole stage codegen.
Then we tried the same with bcftools, split by chromosome:
This took with 46 cores and 184GB RAM around three hours.
Where does this difference stem from? Am I doing some mistake or how can I improve on that?