Open MigQ2 opened 2 months ago
Not directly, because I use external tables and dynamicPartitionOverwrite, which don't seem to be supported.
I could probably create a custom UnityCatalogTableDataset
and make it work for me but I feel my use case should be common enough to make it worth to build something everyone can use
I think it would be great to have an opinionated way of easily integrating kedro with the latest Databricks features (Unity Catalog, workflows, external locations, databricks-connect, databricks-hosted mlflow, etc.), as it is the most common ML platform used with kedro (used by 43% of kedro users)
If you have any ideas in mind I can try to help with discussions or implementation
I think this PR will potentially resolve this? https://github.com/kedro-org/kedro-plugins/pull/827
@MigQ2, I think #827 is a good direction for a few reason:
databricks
instead of modifying the generic SparkDataset, I agree there are rooms to align these datasets.if #827 is merged, would that be enough to solve your problem?
I agree, merging #827 would give me a working solution. Still it would be nice to align both datasets in the future but wouldn't be a blocker
Context
Currently, the preferred method of authentication with a datalake or cloud storage when using Databricks is via Unity Catalog and external locations, not directly authenticating to the storage.
If properly configured, when using Databricks or databricks-connect one should be able to use spark to read from cloud storage without explicitly providing a key or direct authentication method with the storage, which makes it safer, more auditable and gives more granular access control
Description
When using Azure and
abfss://
paths, the currentSparkDataset
implementation tries to connect to the storage directly using fsspec and a credential when initializing the Dataset.Therefore, it forces me to give my kedro project a credential to the
abfss://
ADLS.I want my kedro project to be able to read and write using spark using Unity Catalog external location authentication and not being able to have direct access to the underlying storage
I'm not clear on why
SparkDataset
needs to initialize the filesystem. It seems to be used later in _load_schema_from_file() but I'm not clear on why this is neededPossible Implementation
Would it be possible to completely remove all fsspec interactions with the data and make it all via Spark?