kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.
Apache License 2.0
92 stars 88 forks source link

External tables support for SparkHiveDataSet #163

Open DebanjanBanerjeeQB opened 1 year ago

DebanjanBanerjeeQB commented 1 year ago

Description

SparkHiveDataset does not allow external hive tables at the moment. External tables are often encountered when the org database is outside hive and the table needs to be hosted in hive. More info available on : https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/using-hiveql/content/hive_create_an_external_table.html

Context

This will broaden the scope for hive datasets. Write now ant externally managed hive dataset needs to be referenced via a custom dataset and this happens quite often

Possible Implementation

Implementation is super simple. User needs to specify the keyword "External" in the DDL and specify a path for the table schema. Both can be tactically managed/input via catalog. Basis this input , the dataset should internally be able to decide the next course of actions and load/save data accordingly

Possible Alternatives

Accessing Hive table via HQL (but this again requires a HiveQueryDataSet (custom) ) which can access the metastore and query (bit slow)

merelcht commented 1 year ago

Thanks for the suggestion @DebanjanBanerjeeQB ! We would very much welcome a contribution for this. Since this is a datasets related issue, please add any contributions in the new datasets repo: https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets

MinuraPunchihewa commented 2 weeks ago

Hey @merelcht, I would like to take this up.

merelcht commented 2 weeks ago

Thanks @MinuraPunchihewa ! Just go ahead and create a PR whenever you're ready 🙂