projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
264 stars 111 forks source link

Huge (15x-150x) performance regression in VCF parsing after release 0.2 #400

Closed dna0ff closed 2 years ago

dna0ff commented 3 years ago

There is huge performance regression in VCF parsing since release 0.2

Test dataset

Test bed

Test results

Test sources

Running options

time ./spark-submit \
    --master local[*] \
    --driver-memory=400g \
    --packages=io.projectglow:glow-spark2_2.11:1.0.1,io.delta:delta-core_2.11:0.5.0 \
    --class org.annii.etl2delta \
    /path/to/jar \
    --path /path/to/vcf \
    --path2save /path/to/delta

Notes

thanks, dmitry

williambrandler commented 3 years ago

Hey dmitry thanks for sharing these benchmarks. Wow that does seem like a very long time to ingest a dataset of that size. Perhaps there is a simple fix

To investigate further please rerun the analysis on chromosome 22 of the 1000 Genomes (ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz).

We can then run it on the same dataset and we can compare and contrast results before investigating the cause in your dataset. Thanks!

dna0ff commented 3 years ago

Hi William,

Thank you for looking at it.

With all due respect, I have a hard time to understand how an absolute time for processing an arbitrary single chromosome on an arbitrary hardware will let you "investigate further" or understand the root cause of performance regression.

Assume you have result time T1 for processing chr22 with N1 samples on hardware above. This result will not be comparable with your results as you'll run on different hardware with different resources available. T1 also won't be meaningfully comparable with the whole dataset results provided in the report.

It's not an absolute time that matters in performance regression tests. One would need to run tests in all other configurations to make sense out of the results. I've done that with other private datasets - that's how the issue was revealed.

I'm happy to share the dataset used in testing if you really need it - let me know. It's 1000 Genome project FASTQ for 51 WGS samples processed by fastq2gvcf pipeline and then jointly called. That being said, having a particular dataset in testing is not essential unless you are unable to reproduce performance regression with any other reasonable dataset, which is unlikely.

Can you please confirm that you are unable to reproduce the issue in your environment or that any info that is required for reproducing it is missing in the report ?

thanks, dmitry

williambrandler commented 3 years ago

ah ok, please share over the 51 genome dataset you tested on, thanks!

karenfeng commented 3 years ago

@williambrandler, I'd recommend checking how much of the performance change is the result of the VCF parser versus the split_multiallelics transformer, which is more sophisticated than the original splitToBiallelic=true option and would therefore be more expensive. @kianfar77 would have more context; he also identified ways to improve performance such as by turning off whole-stage codeegen.

williambrandler commented 3 years ago

certainly the effort that went into the improving split multiallelics is part of the story, but I don't think it is the full explanation.

Having the same dataset to test on will help troubleshoot so we can explore if there are specific characteristics of this dataset that are causing the problem, such as a skewed distribution of indels.

And it will help so we can make explicit recommendations to users in future on how to improve ingest efficiency

dna0ff commented 3 years ago

Karen,

It's a good point to isolate split multiallelic functionality for diagnostic purposes. The reason it hasn't been initially done is that splitting is the essential part of the pipeline and the final solution isn't much relevant without splitting.

For diagnostic purposes, I will run tests without split_multiallelics and will let you know.

I didn't mention in the initial report, but there was a ~15x performance drop in 0.3 as well while still using the splitToBiallelic=true option. This in part must be related to single thread implementation, but it's unclear to what extent. (0.3 based pipeline was ETL to Kudu with 15x drop in performance compared to 0.2 to Kudu). This drop is eclipsed by the 150x drop in 0.4/0.5 that comes with the split_multiallelics transformer.

dna0ff commented 3 years ago

William, I'll get in touch with you wrt VCF transfer.

dna0ff commented 3 years ago

Hi All,

Results without splitting multiallelics:

Glow 0.2 without splitToBiallelic

Glow 1.0.1 without split_multiallelics with default parser

Glow 1.0.1 without split_multiallelics with htsjdk parser

Let me know if you would like me to run diagnostic binaries for split_multiallelics transformer.

williambrandler commented 3 years ago

ah ok thanks for the update, so this is primarily driven by the split_multiallelics transformer. A user could apply a filter on indels/STRs so this step is eliminated for 90% of variants.

What do you think of this workaround @dnafault?

dna0ff commented 3 years ago

It is important to have indels split in the resulting dataset William. Throwing away the majority of variants that need splitting is not a solution.

What we really need is not a workaround. We need to figure out why performance is so terrible and fix the implementation.

williambrandler commented 2 years ago

by filter I mean filter biallelic SNPs and apply this function only to the multiallelic variants, where len(alternateAlleles) > 1.

Then you can merge the biallelic SNPs back in afterwards.

We are working on perf improvements for the next release. Do you have any ideas on how to improve this? That would be helpful

williambrandler commented 2 years ago

In recent benchmarks, the entire ingest and QC workflow is only ~10x more compute than one run of linear regression

So if you run GWAS 100+ times, the cost of VCF parsing and QC becomes trivial

We will continue to monitor VCF parsing and QC steps via benchmarks and seek to optimize them with cluster configurations and spark configuration tuning.

Thanks for raising this issue @dnafault, closing it now

dna0ff commented 2 years ago

Dear Glow Team.

The issue has been closed without a single code commit or a fix of any kind - with the only justification essentially saying "if you parse the data and then perform computations for a really, really long time, then the initial parsing time is negligible".

I don't quite understand how Databricks could even consider closing the issue with such a resolution.

There are other meaningful Glow use cases beyond running GWAS on the loop. In ETL applications, extracting and transforming parts take the vast majority of time and the load part is tiny - in a significant fraction of use cases.

ETL to Delta test case, provided in the issue report, demonstrates that performance drop between 0.2 and 1.0.* is at least 15x. In our production workloads with ETL to Kudu the performance drop is similar. Nothing has been fixed.

This remains to be a major issue, "which makes all releases after 0.2 unusable for processing medium to large cohorts" as highlighted in the first post. Unusable. Even in GWAS that happen to require split multiallelic on large cohorts.

Karen (@karenfeng), Henry (@henrydavidge) - what's the Databricks' engineers position on this ?

regards, dmitry