Open alamastor opened 7 months ago
I've also encountered this. catalog.tableExists
only was introduced in spark 3.3, so making this change will break some backwards compatibility (current constraint is pyspark>=2.2
). The datasets itself require Python 3.9. This makes that the effective lower bound is pyspark>3
already. I'm in favour of upgrading.
We welcome PR contributions to fix this!
Description
SparkHiveDataset.exists
raises when called using a Databricks Connect V2 SparkSession.Using
kedro-plugins
commitf59e930
, i.e. an unreleased version, downstream of https://github.com/kedro-org/kedro-plugins/pull/352 (which adds support for DB Connect V2).This occurs because DB Connect V2 doesn't support accessing
_jsparkSession
on theSparkSession
, however it's used SparkHiveDataset.exists.The obvious solution is to replace
_get_spark()._jsparkSession.catalog().tableExists(self._database, self._table)
with_get_spark().catalog.tableExists(self._database, self._table)
, however there may be a reason_jsparkSession
was used that I'm not aware of.I'm happy to raise a PR with this change.
Context
Use
SparkHiveDataset
with Databricks connect V2.Steps to Reproduce
kedro-plugins
from master / a commit downstream of https://github.com/kedro-org/kedro-plugins/pull/352SparkHiveDataset
Expected Result
The dataset doesn't raise when calling
_exists
(works with Databricks connect V1)Actual Result