feat(datasets) Experimental - sqlframe datasets - Githubissues

kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.

Apache License 2.0

84 stars 76 forks source link

feat(datasets) Experimental - sqlframe datasets #694

Open datajoely opened 1 month ago

datajoely commented 1 month ago

Description

Leveraging sqlframe a new dataframe library which targets SQL backends (e.g. duckdb/bigquery/postgres) but exposes the PySpark API data frame syntax.... without the JVM or actually running Spark itself.

This has two major benefits for users:

Like Ibis it allows users to leverage SQL platforms as an execution engine in addition to a storage engine. Approaches like our pandas.SQLTableDataset are naive in the sense they don't use the SQL engine for processing, only storage.
For users already accustomed to Spark syntax or brownfield projects already written in spark this provides a low-friction adoption route.

Development notes

This has been tested locally in the terminal, I've not yet written formal tests. Experimental mode baby 😎 .
I've also done some funky OmegaConf resolver stuff so that the SQL connection can be lazily defined in YAML without creating a super complicated dataset class whilst still supporting dynamic switching of back-ends.

Checklist

[ ] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
[x] Updated the documentation to reflect the code changes
[ ] Added a description of this change in the relevant RELEASE.md file
[ ] Added tests to cover my changes