Closed dna0ff closed 2 years ago
Hey dmitry thanks for sharing these benchmarks. Wow that does seem like a very long time to ingest a dataset of that size. Perhaps there is a simple fix
To investigate further please rerun the analysis on chromosome 22 of the 1000 Genomes (ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz).
We can then run it on the same dataset and we can compare and contrast results before investigating the cause in your dataset. Thanks!
Hi William,
Thank you for looking at it.
With all due respect, I have a hard time to understand how an absolute time for processing an arbitrary single chromosome on an arbitrary hardware will let you "investigate further" or understand the root cause of performance regression.
Assume you have result time T1 for processing chr22 with N1 samples on hardware above. This result will not be comparable with your results as you'll run on different hardware with different resources available. T1 also won't be meaningfully comparable with the whole dataset results provided in the report.
It's not an absolute time that matters in performance regression tests. One would need to run tests in all other configurations to make sense out of the results. I've done that with other private datasets - that's how the issue was revealed.
I'm happy to share the dataset used in testing if you really need it - let me know. It's 1000 Genome project FASTQ for 51 WGS samples processed by fastq2gvcf pipeline and then jointly called. That being said, having a particular dataset in testing is not essential unless you are unable to reproduce performance regression with any other reasonable dataset, which is unlikely.
Can you please confirm that you are unable to reproduce the issue in your environment or that any info that is required for reproducing it is missing in the report ?
thanks, dmitry
ah ok, please share over the 51 genome dataset you tested on, thanks!
@williambrandler, I'd recommend checking how much of the performance change is the result of the VCF parser versus the split_multiallelics
transformer, which is more sophisticated than the original splitToBiallelic=true
option and would therefore be more expensive. @kianfar77 would have more context; he also identified ways to improve performance such as by turning off whole-stage codeegen.
certainly the effort that went into the improving split multiallelics is part of the story, but I don't think it is the full explanation.
Having the same dataset to test on will help troubleshoot so we can explore if there are specific characteristics of this dataset that are causing the problem, such as a skewed distribution of indels.
And it will help so we can make explicit recommendations to users in future on how to improve ingest efficiency
Karen,
It's a good point to isolate split multiallelic functionality for diagnostic purposes. The reason it hasn't been initially done is that splitting is the essential part of the pipeline and the final solution isn't much relevant without splitting.
For diagnostic purposes, I will run tests without split_multiallelics
and will let you know.
I didn't mention in the initial report, but there was a ~15x performance drop in 0.3 as well while still using the splitToBiallelic=true
option. This in part must be related to single thread implementation, but it's unclear to what extent. (0.3 based pipeline was ETL to Kudu with 15x drop in performance compared to 0.2 to Kudu). This drop is eclipsed by the 150x drop in 0.4/0.5 that comes with the split_multiallelics
transformer.
William, I'll get in touch with you wrt VCF transfer.
Hi All,
Results without splitting multiallelics:
Glow 0.2 without splitToBiallelic
2 min 48 sec
(vs 3 min 14 sec
)Glow 1.0.1 without split_multiallelics
with default parser
2 min 19 sec
(vs 44 min 22 sec
)Glow 1.0.1 without split_multiallelics
with htsjdk
parser
2 min 46 sec
Let me know if you would like me to run diagnostic binaries for split_multiallelics transformer.
ah ok thanks for the update, so this is primarily driven by the split_multiallelics transformer. A user could apply a filter on indels/STRs so this step is eliminated for 90% of variants.
What do you think of this workaround @dnafault?
It is important to have indels split in the resulting dataset William. Throwing away the majority of variants that need splitting is not a solution.
What we really need is not a workaround. We need to figure out why performance is so terrible and fix the implementation.
by filter I mean filter biallelic SNPs and apply this function only to the multiallelic variants, where len(alternateAlleles) > 1.
Then you can merge the biallelic SNPs back in afterwards.
We are working on perf improvements for the next release. Do you have any ideas on how to improve this? That would be helpful
In recent benchmarks, the entire ingest and QC workflow is only ~10x more compute than one run of linear regression
So if you run GWAS 100+ times, the cost of VCF parsing and QC becomes trivial
We will continue to monitor VCF parsing and QC steps via benchmarks and seek to optimize them with cluster configurations and spark configuration tuning.
Thanks for raising this issue @dnafault, closing it now
Dear Glow Team.
The issue has been closed without a single code commit or a fix of any kind - with the only justification essentially saying "if you parse the data and then perform computations for a really, really long time, then the initial parsing time is negligible".
I don't quite understand how Databricks could even consider closing the issue with such a resolution.
There are other meaningful Glow use cases beyond running GWAS on the loop. In ETL applications, extracting and transforming parts take the vast majority of time and the load part is tiny - in a significant fraction of use cases.
ETL to Delta test case, provided in the issue report, demonstrates that performance drop between 0.2 and 1.0.* is at least 15x. In our production workloads with ETL to Kudu the performance drop is similar. Nothing has been fixed.
This remains to be a major issue, "which makes all releases after 0.2 unusable for processing medium to large cohorts" as highlighted in the first post. Unusable. Even in GWAS that happen to require split multiallelic on large cohorts.
Karen (@karenfeng), Henry (@henrydavidge) - what's the Databricks' engineers position on this ?
regards, dmitry
There is huge performance regression in VCF parsing since release 0.2
Test dataset
Test bed
Test results
Glow 0.2
splitToBiallelic=true
as a read option3 min 14 sec
Glow 0.4 & 0.5
split_multiallelics
transformer to split multi allelic variants478 min = 7.96 hrs
Glow 0.6 & 1.0.1
split_multiallelics
transformer to split multi allelic variants44 min 22 sec
Test sources
Running options
Notes
thanks, dmitry