Open josephd000 opened 3 years ago
@josephd000 I think that is a good question for the Databricks folks.
My understanding is there are some extra levels of indirection with Spark connection when working with a Databricks cluster and also some form of jar file loading logic built into Databricks runtime which is entirely proprietary, so you will need some additional steps to make it work on a Databricks cluster.
Meanwhile if I do find something simple that make the Databricks use case work I'll let you know.
@yitao-li , I went digging through the sparklyr.flint code and found the non-exported function, sparklyr.flint:::spark_dependencies()
. Running this, it returned:
sparklyr.flint:::spark_dependencies(spark_version = "3.0.1", scala_version = "2.12")
$jars
NULL
$packages
[1] "org.clapper:grizzled-slf4j_2.12:1.3.4" "org.sparklyr:sparklyr-flint_3-0_2-12:0.7.0"
$initializer
NULL
$catalog
NULL
$repositories
[1] "https://github.com/org-sparklyr/sparklyr.flint/raw/maven2"
attr(,"class")
[1] "spark_dependency"
I then created those "Libraries" on Databricks by passing in the "packages" and "repositories" where the Databricks Library GUI asks for "Coordinates" and "Repository", respectively. After installing these two "Libraries" on my cluster, I was able to successfully use sparklyr.flint::from_sdf()
! :)
@josephd000 Good to know! :+1:
I guess I can look into whether those things can be streamlined a bit for Databricks clusters. In all other scenarios (e.g., working with a EMR cluster or running Spark in local mode) all dependencies are taken care of automatically based on what sparklyr.flint:::spark_dependencies()
returns. I think sparklyr
is trying to do the same with Databricks connection as well but probably installed the jar files to the wrong location somehow.
I have the same issue with Spark 3.1.1, Scala 2.12, Sparklyr 1.7.1 and Sparklyr.flint 0.2.1. I don't think I can install libraries on the cluster, I hope there will be some smooth solution soon. Thank you for the great looking package!
@kehldaniel Did you also create a sparklyr
connection using
sc <- spark_connect(method = "databricks")
or similar?
Yes, (after trying hard with my own code that is running on my own laptop) I am running the exact same lines of code as in the original post by josephd000 and get the same error.
Error
Expectation
That I can use basic sparklyr.flint functions on Azure Databricks without classpath errors by using
install.packages("sparklyr.flint")
.Details
I've created a "Library" with
flint-0.6.0
from Maven and installed it onto my cluster, detached and reattached my notebook, calledlibrary(sparklyr.flint)
beforespark_connect()
and it still can't find the library.Config
Reproducible code