Support `SparkDataset` authentication via Unity Catalog and Databricks external locations

MigQ2 commented 2 months ago

Context

Currently, the preferred method of authentication with a datalake or cloud storage when using Databricks is via Unity Catalog and external locations, not directly authenticating to the storage.

If properly configured, when using Databricks or databricks-connect one should be able to use spark to read from cloud storage without explicitly providing a key or direct authentication method with the storage, which makes it safer, more auditable and gives more granular access control

Description

When using Azure and abfss:// paths, the current SparkDataset implementation tries to connect to the storage directly using fsspec and a credential when initializing the Dataset.

Therefore, it forces me to give my kedro project a credential to the abfss:// ADLS.

I want my kedro project to be able to read and write using spark using Unity Catalog external location authentication and not being able to have direct access to the underlying storage

I'm not clear on why SparkDataset needs to initialize the filesystem. It seems to be used later in _load_schema_from_file() but I'm not clear on why this is needed

Possible Implementation

Would it be possible to completely remove all fsspec interactions with the data and make it all via Spark?

noklam commented 2 months ago

https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-4.1.0/api/kedro_datasets.databricks.ManagedTableDataset.html

Would this dataset help?

MigQ2 commented 2 months ago

Not directly, because I use external tables and dynamicPartitionOverwrite, which don't seem to be supported.

I could probably create a custom UnityCatalogTableDataset and make it work for me but I feel my use case should be common enough to make it worth to build something everyone can use

I think it would be great to have an opinionated way of easily integrating kedro with the latest Databricks features (Unity Catalog, workflows, external locations, databricks-connect, databricks-hosted mlflow, etc.), as it is the most common ML platform used with kedro (used by 43% of kedro users)

If you have any ideas in mind I can try to help with discussions or implementation

MinuraPunchihewa commented 2 months ago

I think this PR will potentially resolve this? https://github.com/kedro-org/kedro-plugins/pull/827

noklam commented 2 months ago

@MigQ2, I think #827 is a good direction for a few reason:

UnityCatalog is still, very much a databricks only thing so it feels right to move it to databricks instead of modifying the generic SparkDataset, I agree there are rooms to align these datasets.
As I understand, there are 2 requirements here, authenticate via UnityCatalog & external tables.

if #827 is merged, would that be enough to solve your problem?

MigQ2 commented 2 months ago

I agree, merging #827 would give me a working solution. Still it would be nice to align both datasets in the future but wouldn't be a blocker

kedro-org / kedro-plugins