lucidworks / spark-solr

Tools for reading data from Solr as a Spark RDD and indexing objects from Spark into Solr using SolrJ.
Apache License 2.0
446 stars 250 forks source link

Support for Spark3.x and scala 2.12.x #322

Closed rajeshwrn closed 2 years ago

rajeshwrn commented 3 years ago
  1. Support for latest spark and scala versions. Spark 3.0.1 and Scala 2.12.12
  2. Updated Hadoop dependency to 3.2.2
  3. And also upgraded maven to 3
  4. Fixed code deprecation warnings with scala 2.12 & spark 3. Also fixed some minor code quality issue warnings.

Build from Source mvn clean package -DskipTests

falloutdurham commented 3 years ago

Oh my, we were hoping to get around to this soon, but this is great! Let me check it out and we'll try to get a new major release out in June :)

falloutdurham commented 3 years ago

@rajeshwrn Having some trouble testing this out via unit tests and with integration right now. Will continue working on it to see where I'm going wrong :)

avenherak commented 3 years ago

Oh my, we were hoping to get around to this soon, but this is great! Let me check it out and we'll try to get a new major release out in June :) When new release of spark-solr will be out and what's actual version support table? (Seems README.md is not up to date) @falloutdurham

viktor-klymenko commented 3 years ago

Hi guys Do you know what is the status of this PR? Is it planned to be released soon?

viktor-klymenko commented 3 years ago

Btw, @rajeshwrn great work!!

barcac commented 3 years ago

I tried today to build this PR locally and use it in a flow which dumps from Solr v8 one of the collections

e.g. scala> val containersMetadataDF = spark.read.format("solr").options(Map("zkhost" -> s"zookeeper1:2181,zookeeper2:2181,zookeeper3:2181/solr-green", "collection"-> "containerMetadata")).load()

This basically failed with:


2021-08-24 12:19:02 WARN  BaseHttpClusterStateProvider - Attempt to fetch cluster state from zookeeper1:2181,zookeeper2:2181,zookeeper3:2181/solr-green failed.
org.apache.solr.client.solrj.SolrServerException: IOException occurred when talking to server at: zookeeper1:2181,zookeeper2:2181,zookeeper3:2181/solr-green
    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:695)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
    at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
    at org.apache.solr.client.solrj.impl.BaseHttpClusterStateProvider.fetchLiveNodes(BaseHttpClusterStateProvider.java:190)
    at org.apache.solr.client.solrj.impl.BaseHttpClusterStateProvider.init(BaseHttpClusterStateProvider.java:64)
    at org.apache.solr.client.solrj.impl.HttpClusterStateProvider.<init>(HttpClusterStateProvider.java:34)
    at org.apache.solr.client.solrj.impl.CloudSolrClient$Builder.build(CloudSolrClient.java:464)
    at com.lucidworks.spark.util.SolrSupport$.getSolrCloudClient(SolrSupport.scala:220)
    at com.lucidworks.spark.util.SolrSupport$.getNewSolrCloudClient(SolrSupport.scala:242)
    at com.lucidworks.spark.util.CacheCloudSolrClient$$anon$1.load(SolrSupport.scala:38)
    at com.lucidworks.spark.util.CacheCloudSolrClient$$anon$1.load(SolrSupport.scala:36)
    at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
    at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
    at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
    at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
    at com.lucidworks.spark.util.SolrSupport$.getCachedCloudClient(SolrSupport.scala:250)
    at com.lucidworks.spark.util.SolrSupport$.getSolrBaseUrl(SolrSupport.scala:254)
    at com.lucidworks.spark.SolrRelation.dynamicSuffixes$lzycompute(SolrRelation.scala:99)
    at com.lucidworks.spark.SolrRelation.dynamicSuffixes(SolrRelation.scala:97)
    at com.lucidworks.spark.SolrRelation.getBaseSchemaFromConfig(SolrRelation.scala:298)
    at com.lucidworks.spark.SolrRelation.querySchema$lzycompute(SolrRelation.scala:238)
    at com.lucidworks.spark.SolrRelation.querySchema(SolrRelation.scala:107)
    at com.lucidworks.spark.SolrRelation.schema(SolrRelation.scala:427)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:448)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226)
    at $line26.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:23)
    at $line26.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:27)
    at $line26.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(<console>:29)
    at $line26.$read$$iw$$iw$$iw$$iw$$iw.<init>(<console>:31)
    at $line26.$read$$iw$$iw$$iw$$iw.<init>(<console>:33)
    at $line26.$read$$iw$$iw$$iw.<init>(<console>:35)
    at $line26.$read$$iw$$iw.<init>(<console>:37)
    at $line26.$read$$iw.<init>(<console>:39)
    at $line26.$read.<init>(<console>:41)
    at $line26.$read$.<init>(<console>:45)
    at $line26.$read$.<clinit>(<console>)
    at $line26.$eval$.$print$lzycompute(<console>:7)
    at $line26.$eval$.$print(<console>:6)
    at $line26.$eval.$print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:745)
    at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1021)
    at scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:574)
    at scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:41)
    at scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:37)
    at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41)
    at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:600)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:570)
    at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:894)
    at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:762)
    at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:464)
    at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:485)
    at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:239)
    at org.apache.spark.repl.Main$.doMain(Main.scala:78)
    at org.apache.spark.repl.Main$.main(Main.scala:58)
    at org.apache.spark.repl.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:959)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1038)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1047)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: shaded.apache.http.client.ClientProtocolException: URI does not specify a valid host name: zookeeper1:2181,zookeeper2:2181,zookeeper3:2181/solr-green/admin/collections?action=CLUSTERSTATUS&wt=javabin&version=2
    at shaded.apache.http.impl.client.CloseableHttpClient.determineTarget(CloseableHttpClient.java:95)
    at shaded.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
    at shaded.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:571)
    ... 79 more
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: Couldn't initialize a HttpClusterStateProvider (is/are the Solr server(s), [zookeeper1:2181,zookeeper2:2181,zookeeper3:2181/solr-green], down?)
  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2263)
  at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
  at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
  at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
  at com.lucidworks.spark.util.SolrSupport$.getCachedCloudClient(SolrSupport.scala:250)
  at com.lucidworks.spark.util.SolrSupport$.getSolrBaseUrl(SolrSupport.scala:254)
  at com.lucidworks.spark.SolrRelation.dynamicSuffixes$lzycompute(SolrRelation.scala:99)
  at com.lucidworks.spark.SolrRelation.dynamicSuffixes(SolrRelation.scala:97)
  at com.lucidworks.spark.SolrRelation.getBaseSchemaFromConfig(SolrRelation.scala:298)
  at com.lucidworks.spark.SolrRelation.querySchema$lzycompute(SolrRelation.scala:238)
  at com.lucidworks.spark.SolrRelation.querySchema(SolrRelation.scala:107)
  at com.lucidworks.spark.SolrRelation.schema(SolrRelation.scala:427)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:448)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
  at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:226)
  ... 47 elided
Caused by: java.lang.RuntimeException: Couldn't initialize a HttpClusterStateProvider (is/are the Solr server(s), [zookeeper1:2181,zookeeper2:2181,zookeeper3:2181/solr-green], down?)
  at org.apache.solr.client.solrj.impl.CloudSolrClient$Builder.build(CloudSolrClient.java:466)
  at com.lucidworks.spark.util.SolrSupport$.getSolrCloudClient(SolrSupport.scala:220)
  at com.lucidworks.spark.util.SolrSupport$.getNewSolrCloudClient(SolrSupport.scala:242)
  at com.lucidworks.spark.util.CacheCloudSolrClient$$anon$1.load(SolrSupport.scala:38)
  at com.lucidworks.spark.util.CacheCloudSolrClient$$anon$1.load(SolrSupport.scala:36)
  at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
  at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
  at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
  ... 64 more
Caused by: java.lang.RuntimeException: Tried fetching live_nodes using Solr URLs provided, i.e. [zookeeper1:2181,zookeeper2:2181,zookeeper3:2181/solr-green]. However, succeeded in obtaining the cluster state from none of them.If you think your Solr cluster is up and is accessible, you could try re-creating a new CloudSolrClient using working solrUrl(s) or zkHost(s).
  at org.apache.solr.client.solrj.impl.BaseHttpClusterStateProvider.init(BaseHttpClusterStateProvider.java:73)
  at org.apache.solr.client.solrj.impl.HttpClusterStateProvider.<init>(HttpClusterStateProvider.java:34)
  at org.apache.solr.client.solrj.impl.CloudSolrClient$Builder.build(CloudSolrClient.java:464)
  ... 72 more

I know this is not yet merged / released but I was wondering if during your tests you found the same issue. I've checked both on both Solr and ZK and these can be accessed from the cluster. So there are no connectivity issues.

rajeshwrn commented 3 years ago

@barcac updated the code. It will work now

rajeshwrn commented 3 years ago

There was an issue in solr client creation. While updating the deprecated solr method, missed adding one param.

https://github.com/rajeshwrn/spark-solr/blob/bf4345d2a179247b606563160ce1631afd947246/src/main/scala/com/lucidworks/spark/util/SolrSupport.scala#L194-L200

From solrj7.3, the withZkHost() method was deprecated. I tried to change this with Builder constructor but in code solrConfig options only has zkhost. So I reverted this back to the older method. It is working now. Since it is solr version dependent, for spark3 this change is not required.

Looks like this change requires effort and some changes in the design. we need to find a way to pass zkhosts and zkchroot separately in the options. as it is below

val options = Map( "collection" -> "{solr_collection_name}", "zkhost" -> "{zkhost1, zkhost2,zkhost3}", "zkchroot" -> "{zkchroot string}" ) val df = spark.read.format("solr") .options(options) .load

and

val zkServers = List(cloudClientParams.zkhost1, cloudClientParams.zkhost2, cloudClientParams.zkhost3) val solrClientBuilder = new CloudSolrClient.Builder(zkServers, Optional.of(cloudClientParams.zkchroot))

rajeshwrn commented 3 years ago

@falloutdurham this may be the issue that you encountered while executing the unit test cases.

barcac commented 3 years ago

hi @rajeshwrn ! thanks for the quick fix. will give it a try

looking forward for the release of it

falloutdurham commented 2 years ago

@rajeshwrn I will give it another whirl this week! I notice that it's still failing on the test run. If it works after skipTests, though, I'll see if I can look into those failures and get them resolved.

falloutdurham commented 2 years ago

The good news is I've built the new version and it seems promising! I've fixed up some of the failing tests - just one more to go now, but that seems to be an issue with some of the SolrRelations interaction with Catalyst, so it may take me a little longer to track down and eliminate it.

falloutdurham commented 2 years ago

Just got a complete clean run! 💃

Let me get some ducks in a row internally over the next few days - I'll try to merge some of the outstanding PRs and rebase this on top of them before cutting a 4.0.0 release.

eswara-prasad-tm commented 2 years ago

Awesome Rajesh

barcac commented 2 years ago

@rajeshwrn tried it now and with this PR's latest changes the snapshot-build works now in our case; no more connectivity issues