kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
94 stars 89 forks source link

kedro-datasets: ExternalTableDataset for Databricks #817

Closed MinuraPunchihewa closed 3 weeks ago

MinuraPunchihewa commented 2 months ago

Description

I use Kedro on Databricks heavily, but in most of my projects the datasets in use are External tables (for a myriad of reasons) and I am currently unable to perform my IO operations easily.

Context

As I have said above, my datasets are maintained as External tables and I am certain there are many others users (and organizations) out there that follow a similar pattern, at least for some of their datasets.

Possible Implementation

It should be possible to implement at ExternalTableDataset for Kedro to allow the said IO operations.

Possible Alternatives

I have considered converting all of my datasets on Databricks to Managed tables, but at the moment, this does not make sense for me from an architectural point of view.

MinuraPunchihewa commented 2 months ago

Hey @astrojuanlu and the Kedro team, I would love to work on this and make a contribution if you believe it would be useful.

astrojuanlu commented 2 months ago

Hi @MinuraPunchihewa , if you want to open a pull request please go ahead!

MinuraPunchihewa commented 2 months ago

Thanks, @astrojuanlu. Will do.

MigQ2 commented 1 month ago

Hi @MinuraPunchihewa, I'm a community user who is on a very similar situation to yours, thanks for tackling this External Tables implementation. Great initiative!

I had some things in mind that I thing would be good to consider if we are to refactor this dataset. I'll drop them here so that you and the kedro maintainers can discuss what's the best way to tackle it. I'm happy to contribute as well.

  1. I think the writing parameters are a bit too strict. how about we use something like save_args in SparkDataset to make it more flexible? E.g. this way I could do things like mergeSchema=True or partitionOverwriteMode = 'dynamic' and customize write options more flexibly. Could this also remove the need for explicitly declaring partition_columns or any other parameter?

  2. As I mentioned here, spark object is no longer a Singleton in recent databricks-connect versions. This makes it impossible to pass spark options as the _get_spark() function does not accept any parameters

  3. How about adding liquid clustering support? It is the recommended way of optimizing reads and writes by Databricks currently

  4. How about we rename owner_group parameter to something more general like table_owner? Users or service principals can be owners of tables as well, not only groups. I also found the docs say if table access control is enabled in your workspace, specifying owner_group will transfer ownership of the table and database to this owner but I did not find this logic anywhere in the code. Am I missing something or is not doing it? Also, I don't think it should change the database owner, just the table.

  5. Related to the previous point, there are some optional housekeeping tasks that would be good to provide an opinionated way of tackling them in a Databricks+Kedro setup (e.g. managing table ownership, running OPTIMIZE or VACUUM after every run, etc.). What so you think is the best way of tackling this? Adding parameters in the Dataset like optimize_on_write = True and let the dataset do all the logic? Or maybe create some ready-to-use hooks that people can take into their projects? Also, does it really make sense to have two different dataset classes for external and managed tables, or could a single UnityCatalogTableDataset class with a location parameter work that uses managed tables when location is None and external otherwise?

  6. Should we rename kedro-datasets.databricks to kedro-datasets.unity_catalog considering Unity Catalog was recently open sourced?

If you think these points and discussion are meningful I can keep posting ideas that come to my mind and contribute to them.

We can also keep the conversation about the details in Kedro Slack to be more agile, you can find me here

noklam commented 1 month ago

@MigQ2

Should we rename kedro-datasets.databricks to kedro-datasets.unity_catalog considering Unity Catalog was recently open sourced?

On this point, I would say this is not ready. We did some research recently looking at UnityCatalog (from databricks) and Polaris (from Snowflake), they are far from mature and we cannot expect interoperability.

MinuraPunchihewa commented 1 month ago

Hey @MigQ2, Thank you very much for the insight. You make some great points. I (with the consent of the Kedro team) will definitely take these into account moving forward.