SparkHiveDataset is incompatible with Databricks Connect V2

alamastor commented 7 months ago

Description

SparkHiveDataset.exists raises when called using a Databricks Connect V2 SparkSession.

Using kedro-plugins commit f59e930, i.e. an unreleased version, downstream of https://github.com/kedro-org/kedro-plugins/pull/352 (which adds support for DB Connect V2).

This occurs because DB Connect V2 doesn't support accessing _jsparkSession on the SparkSession, however it's used SparkHiveDataset.exists.

The obvious solution is to replace _get_spark()._jsparkSession.catalog().tableExists(self._database, self._table) with _get_spark().catalog.tableExists(self._database, self._table), however there may be a reason _jsparkSession was used that I'm not aware of.

I'm happy to raise a PR with this change.

Context

Use SparkHiveDataset with Databricks connect V2.

Steps to Reproduce

Intstall kedro-plugins from master / a commit downstream of https://github.com/kedro-org/kedro-plugins/pull/352
Setup Databricks Connect per https://docs.databricks.com/en/dev-tools/databricks-connect/python/install.html
Use a SparkHiveDataset

Expected Result

The dataset doesn't raise when calling _exists (works with Databricks connect V1)

Actual Result

[JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `_jsparkSession` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session.

sbrugman commented 6 months ago

I've also encountered this. catalog.tableExists only was introduced in spark 3.3, so making this change will break some backwards compatibility (current constraint is pyspark>=2.2). The datasets itself require Python 3.9. This makes that the effective lower bound is pyspark>3 already. I'm in favour of upgrading.

merelcht commented 1 day ago

We welcome PR contributions to fix this!

kedro-org / kedro-plugins