kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
94 stars 90 forks source link

feat(datasets): Added the Experimental ExternalTableDataset for Databricks #885

Closed MinuraPunchihewa closed 1 month ago

MinuraPunchihewa commented 1 month ago

Description

This PR adds the ExternalTableDataset to support interactions with external tables in Databricks (Unity Catalog).

Fixes https://github.com/kedro-org/kedro-plugins/issues/817

Development notes

The ExternalTableDataset has been implemented by extending the BaseTableDataset that was added here.

These changes have been tested,

  1. Manually, by running the code against a Unity Catalog enabled workspace.
  2. Via the existing and newly added unit tests.

Checklist

MinuraPunchihewa commented 1 month ago

Hey @noklam, @ankatiyar, I've added the ExternalTableDataset as an experimental dataset here. Is there anything else I need to do here, like tests etc.?

noklam commented 1 month ago

@MinuraPunchihewa Tests are super welcomed, although we don't have a databricks cluster to run these tests unless there are ways to run this locally.

MinuraPunchihewa commented 1 month ago

@MinuraPunchihewa Tests are super welcomed, although we don't have a databricks cluster to run these tests unless there are ways to run this locally.

@noklam I just added some tests to cover the logic specific to ExternalTableDataset (outside of BaseTableDataset). I also made a few other changes to the existing tstssuch as removing some duplicate code and moving all of the fixtures to conftest.py.

noklam commented 1 month ago

FAILED tests/databricks/test_base_table_dataset.py::TestBaseTableDataset::test_save_overwrite

Is this an expected fail?

MinuraPunchihewa commented 1 month ago

FAILED tests/databricks/test_base_table_dataset.py::TestBaseTableDataset::test_save_overwrite

Is this an expected fail?

@noklam It has been my experience that when overwriting data to an external table of a format other than Delta, the location has to be provided. Since I am an external location fixture here which requires an environment variable called DATABRICKS_EXTERNAL_LOCATION that does not exist in the test environment, yes, this is expected to fail.

I should have specified a different format though. I just made a commit with that change.

Can you give me an explanation of how these tests are running? There should be a Spark environment available for them to run at all, right?

noklam commented 1 month ago

Since this is an experimental dataset, test will go to here: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets_experimental/tests

This mean tests are not going to be run per every commit in CI. If you look at other Spark tests we run them in a local mode. With Databricks it's unclear whether it possible to run it in a local mode, especially with the platform UnityCatalog (which is different from the open source version).

MinuraPunchihewa commented 1 month ago

Since this is an experimental dataset, test will go to here: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets_experimental/tests

This mean tests are not going to be run per every commit in CI. If you look at other Spark tests we run them in a local mode. With Databricks it's unclear whether it possible to run it in a local mode, especially with the platform UnityCatalog (which is different from the open source version).

Got it. I just moved the tests; I had to copy the conftest.py from as well.

noklam commented 1 month ago

tests are now passing and looks good to me.

MinuraPunchihewa commented 1 month ago

@merelcht I've made the requested changes. Do let me know if I've updated the RELEASE document correctly.

merelcht commented 1 month ago

@merelcht I've made the requested changes. Do let me know if I've updated the RELEASE document correctly.

I've moved it to the section for upcoming release, because 5.1.0 is already out 🙂 Thank again! I'll make sure this gets merged.

merelcht commented 1 month ago

Oh actually @MinuraPunchihewa I just realised this dataset wasn't added to https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml#L176 or https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml#L285. Can you update the pyproject.toml with the necessary dependencies if any?

ankatiyar commented 1 month ago

Oh actually @MinuraPunchihewa I just realised this dataset wasn't added to https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml#L176 or https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml#L285. Can you update the pyproject.toml with the necessary dependencies if any?

Also here: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/docs/source/api/kedro_datasets_experimental.rst :)

MinuraPunchihewa commented 1 month ago

Hey guys, Apologies, I've missed this. I've made the changes now. @merelcht @ankatiyar