Leveraging sqlframe a new dataframe library which targets SQL backends (e.g. duckdb/bigquery/postgres) but exposes the PySpark API data frame syntax.... without the JVM or actually running Spark itself.
This has two major benefits for users:
Like Ibis it allows users to leverage SQL platforms as an execution engine in addition to a storage engine. Approaches like our pandas.SQLTableDataset are naive in the sense they don't use the SQL engine for processing, only storage.
For users already accustomed to Spark syntax or brownfield projects already written in spark this provides a low-friction adoption route.
Development notes
This has been tested locally in the terminal, I've not yet written formal tests. Experimental mode baby š .
I've also done some funky OmegaConf resolver stuff so that the SQL connection can be lazily defined in YAML without creating a super complicated dataset class whilst still supporting dynamic switching of back-ends.
Checklist
[ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
[x] Updated the documentation to reflect the code changes
[ ] Added a description of this change in the relevant RELEASE.md file
Description
Leveraging sqlframe a new dataframe library which targets SQL backends (e.g. duckdb/bigquery/postgres) but exposes the PySpark API data frame syntax.... without the JVM or actually running Spark itself.
This has two major benefits for users:
pandas.SQLTableDataset
are naive in the sense they don't use the SQL engine for processing, only storage.Development notes
OmegaConf
resolver stuff so that the SQL connection can be lazily defined in YAML without creating a super complicated dataset class whilst still supporting dynamic switching of back-ends.Checklist
RELEASE.md
file