feat(datasets): Improved Dependency Management for Spark-based Datasets

MinuraPunchihewa commented 3 weeks ago

Description

This PR a _utils sub-package to house modules with common utility functions that are used across Spark-based datasets. This avoids the need for pyspark to be installed for datasets that will run on Databricks.

Fixes https://github.com/kedro-org/kedro-plugins/issues/849

Development notes

The new _utils package organized the utility functions in three main modules,

databricks_utils.py
spark_utils.py

Additional modules can be added to this sub-package to house code that is used in multiple datasets.

These changes have been tested,

Manually, by running the code locally using ManagedTableDataset and ExternalTableDataset.
Via the existing unit tests.

Checklist

[X] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
[ ] Updated the documentation to reflect the code changes
[ ] Added a description of this change in the relevant RELEASE.md file
[ ] Added tests to cover my changes
[ ] Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

MinuraPunchihewa commented 3 weeks ago

Hey @noklam, I have to test this a little more thoroughly, but can you give me your opinion on the approach taken here?

MinuraPunchihewa commented 3 weeks ago

Hey @noklam, I've now had the opportunity to test out these changes and they seem to work fine. I've tested both ManagedTableDataset and ExternalTableDataset with the reduced dependencies without any issues.

MinuraPunchihewa commented 3 weeks ago

Some more comments I left but I only notice it wasn't sent properly 😅

Haha no problem. I've made the suggested improvements to the type hints, including a couple more involving DBUtils.

MinuraPunchihewa commented 1 week ago

Thanks for this contribution @MinuraPunchihewa ! ⭐ Can you update the release notes and add your change + your name to contributors?

Thanks, @merelcht. I've just updated the release notes.

kedro-org / kedro-plugins