Huge (15x-150x) performance regression in VCF parsing after release 0.2

dna0ff commented 3 years ago

There is huge performance regression in VCF parsing since release 0.2

Below are rough test results for releases where performance changed significantly
Scala, Spark & Delta versions are kept identical to avoid variability. Upgrading to the latest versions for Glow 0.6/1.0.* doesn't change anything in principle.

Test dataset

single joint-called VCF with 51 WGS samples from 1KG project, aligned to grch37
19.7×10⁶ variants total
VCF size 29 Gb uncompressed - used in tests in uncompressed form to eliminate decoding factor

Test bed

a single node with 56 cores, 500GiB RAM, NVMe drives, Ubuntu 18.04.1 LTS

Test results

Glow 0.2
- Scala 2.11 + Spark 2 + Delta 0.5.0
- splitToBiallelic=true as a read option
- ETL to Delta 3 min 14 sec
Glow 0.4 & 0.5
- Scala 2.11 + Spark 2 + Delta 0.5.0
- split_multiallelics transformer to split multi allelic variants
- ETL to Delta ~ 478 min = 7.96 hrs
Glow 0.6 & 1.0.1
- Scala 2.11 + Spark 2 + Delta 0.5.0
- split_multiallelics transformer to split multi allelic variants
- ETL to Delta ~ 44 min 22 sec

Test sources

Glow 0.2
- https://gist.github.com/dna0ff/820f4bcdb3b03126178bf289b59334b0
Glow 0.4 & 0.5
- https://gist.github.com/dna0ff/0927d67cad12fe7a59ffa082b700bc5b
Glow 0.6 & 1.0.1
- https://gist.github.com/dna0ff/1b7732a968aabbb2715b408c8bbf0804

Running options

Glow options are set in sources
default parser is used
all tests were run in local mode, e.g.

time ./spark-submit \
    --master local[*] \
    --driver-memory=400g \
    --packages=io.projectglow:glow-spark2_2.11:1.0.1,io.delta:delta-core_2.11:0.5.0 \
    --class org.annii.etl2delta \
    /path/to/jar \
    --path /path/to/vcf \
    --path2save /path/to/delta

Notes

Releases 0.3, 0.4 & 0.5 perform significant portion of ETL in a single thread
Releases 0.6 & 1.0.1 fixed the single-threaded issue
All releases but 1.0.1 use htsjdk parser, 1.0.* is run with default Glow implementation
Overall performance in 0.6/1.0.* is still ~15x worse than 0.2, which makes all releases after 0.2 unusable for processing medium to large cohorts

thanks, dmitry

williambrandler commented 3 years ago

Hey dmitry thanks for sharing these benchmarks. Wow that does seem like a very long time to ingest a dataset of that size. Perhaps there is a simple fix

To investigate further please rerun the analysis on chromosome 22 of the 1000 Genomes (ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz).

We can then run it on the same dataset and we can compare and contrast results before investigating the cause in your dataset. Thanks!

dna0ff commented 3 years ago

Hi William,

Thank you for looking at it.

With all due respect, I have a hard time to understand how an absolute time for processing an arbitrary single chromosome on an arbitrary hardware will let you "investigate further" or understand the root cause of performance regression.

Assume you have result time T1 for processing chr22 with N1 samples on hardware above. This result will not be comparable with your results as you'll run on different hardware with different resources available. T1 also won't be meaningfully comparable with the whole dataset results provided in the report.

It's not an absolute time that matters in performance regression tests. One would need to run tests in all other configurations to make sense out of the results. I've done that with other private datasets - that's how the issue was revealed.

I'm happy to share the dataset used in testing if you really need it - let me know. It's 1000 Genome project FASTQ for 51 WGS samples processed by fastq2gvcf pipeline and then jointly called. That being said, having a particular dataset in testing is not essential unless you are unable to reproduce performance regression with any other reasonable dataset, which is unlikely.

Can you please confirm that you are unable to reproduce the issue in your environment or that any info that is required for reproducing it is missing in the report ?

thanks, dmitry

williambrandler commented 3 years ago

ah ok, please share over the 51 genome dataset you tested on, thanks!

karenfeng commented 3 years ago

@williambrandler, I'd recommend checking how much of the performance change is the result of the VCF parser versus the split_multiallelics transformer, which is more sophisticated than the original splitToBiallelic=true option and would therefore be more expensive. @kianfar77 would have more context; he also identified ways to improve performance such as by turning off whole-stage codeegen.

williambrandler commented 3 years ago

certainly the effort that went into the improving split multiallelics is part of the story, but I don't think it is the full explanation.

Having the same dataset to test on will help troubleshoot so we can explore if there are specific characteristics of this dataset that are causing the problem, such as a skewed distribution of indels.

And it will help so we can make explicit recommendations to users in future on how to improve ingest efficiency

dna0ff commented 3 years ago

Karen,

It's a good point to isolate split multiallelic functionality for diagnostic purposes. The reason it hasn't been initially done is that splitting is the essential part of the pipeline and the final solution isn't much relevant without splitting.

For diagnostic purposes, I will run tests without split_multiallelics and will let you know.

I didn't mention in the initial report, but there was a ~15x performance drop in 0.3 as well while still using the splitToBiallelic=true option. This in part must be related to single thread implementation, but it's unclear to what extent. (0.3 based pipeline was ETL to Kudu with 15x drop in performance compared to 0.2 to Kudu). This drop is eclipsed by the 150x drop in 0.4/0.5 that comes with the split_multiallelics transformer.

dna0ff commented 3 years ago

William, I'll get in touch with you wrt VCF transfer.

dna0ff commented 3 years ago

Hi All,

Results without splitting multiallelics:

Glow 0.2 without splitToBiallelic

2 min 48 sec (vs 3 min 14 sec)

Glow 1.0.1 without split_multiallelics with default parser

2 min 19 sec (vs 44 min 22 sec)

Glow 1.0.1 without split_multiallelics with htsjdk parser

2 min 46 sec

Let me know if you would like me to run diagnostic binaries for split_multiallelics transformer.

williambrandler commented 3 years ago

ah ok thanks for the update, so this is primarily driven by the split_multiallelics transformer. A user could apply a filter on indels/STRs so this step is eliminated for 90% of variants.

What do you think of this workaround @dnafault?

dna0ff commented 3 years ago

It is important to have indels split in the resulting dataset William. Throwing away the majority of variants that need splitting is not a solution.

What we really need is not a workaround. We need to figure out why performance is so terrible and fix the implementation.

williambrandler commented 2 years ago

by filter I mean filter biallelic SNPs and apply this function only to the multiallelic variants, where len(alternateAlleles) > 1.

Then you can merge the biallelic SNPs back in afterwards.

We are working on perf improvements for the next release. Do you have any ideas on how to improve this? That would be helpful

williambrandler commented 2 years ago

In recent benchmarks, the entire ingest and QC workflow is only ~10x more compute than one run of linear regression

So if you run GWAS 100+ times, the cost of VCF parsing and QC becomes trivial

We will continue to monitor VCF parsing and QC steps via benchmarks and seek to optimize them with cluster configurations and spark configuration tuning.

Thanks for raising this issue @dnafault, closing it now

dna0ff commented 2 years ago

Dear Glow Team.

The issue has been closed without a single code commit or a fix of any kind - with the only justification essentially saying "if you parse the data and then perform computations for a really, really long time, then the initial parsing time is negligible".

I don't quite understand how Databricks could even consider closing the issue with such a resolution.

There are other meaningful Glow use cases beyond running GWAS on the loop. In ETL applications, extracting and transforming parts take the vast majority of time and the load part is tiny - in a significant fraction of use cases.

ETL to Delta test case, provided in the issue report, demonstrates that performance drop between 0.2 and 1.0.* is at least 15x. In our production workloads with ETL to Kudu the performance drop is similar. Nothing has been fixed.

This remains to be a major issue, "which makes all releases after 0.2 unusable for processing medium to large cohorts" as highlighted in the first post. Unusable. Even in GWAS that happen to require split multiallelic on large cohorts.

Karen (@karenfeng), Henry (@henrydavidge) - what's the Databricks' engineers position on this ?

regards, dmitry

projectglow / glow

Huge (15x-150x) performance regression in VCF parsing after release 0.2 #400