projectglow / glow

An open-source toolkit for large-scale genomic analysis
https://projectglow.io
Apache License 2.0
262 stars 106 forks source link

Left overlap join function #578

Closed henrydavidge closed 4 months ago

henrydavidge commented 4 months ago

What changes are proposed in this pull request?

Adds Scala and Python functions for joining two DataFrames on an interval overlap condition. The two language APIs have the same functionality, but there's a bit more convenience functionality in the Python API.

Since Databricks' range join optimization doesn't support left joins for interval overlaps, we separate SNPs (intervals with length 1) and longer intervals from the left side and join them separately. This approach can have major performance benefits when there are many SNPs, as is common in genetics workloads.

On a dataset with 1B left rows and 1M right rows and varying percentages of SNPs in the left table (tested with 1 4 core executor due to quota):

Inner range join + left join, all SNP percentages: 4h
Glow join, 0% SNPs: 4h
Glow join, 50% SNPs: 2h9m
Glow join, 90% SNPs: 0h42m

How is this patch tested?

(Details)