neo4j / neo4j-spark-connector

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs
https://neo4j.com/developer/spark/
Apache License 2.0
313 stars 112 forks source link

Neo4j Spark Connector integration with pyspark #273

Closed JArma19 closed 3 years ago

JArma19 commented 3 years ago

Hi, I'm trying to read nodes from my local neo4jdb for practice purposes by using pyspark and neo4j connector. I've already downloaded the last version of neo4j-connector-apache-spark (2.12) and integrated it in pyspark as explained in the repo at README. However when I try to perform a read using: `from pyspark.sql import SparkSession import os os.environ["JAVA_HOME"] = "C:\Program Files\Java\jdk-15.0.1" os.environ["HADOOP_HOME"] = "C:\Users\arman\Desktop\winutils" os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars file:///C:\Users\arman\Desktop\prova\venv\Lib\site-packages\pyspark\jars\neo4j-connector-apache-spark_2.12-4.0.0.jar pyspark-shell'

spark = SparkSession.builder \ .config('spark.jars', 'C:\Users\arman\Desktop\prova\venv\Lib\site-packages\pyspark\jars\neo4j-connector-apache-spark_2.12-4.0.0.jar') \ .config('spark.jars.packages', 'neo4j-contrib:neo4j-connector-apache-spark_2.12:4.0.0') \ .getOrCreate()

spark.read.format("org.neo4j.spark.DataSource") \ .option("url", "bolt://localhost:7687") \ .option("authentication.basic.username", "neo4j") \ .option("authentication.basic.password", "justin") \ .option("labels", "Person") \ .load() \ .show()`

I get the following error: Blockquote py4j.protocol.Py4JJavaError: An error occurred while calling o38.load. : java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport Blockquote I think it could be related the format string "org.neo4j.spark.DataSource", but don't know how to fix.

I think I'm doing wrong something during configuration Could you please suggest me any guide or tutorial about how to set up properly pyspark in order to run neo4j connector? Thanks for yout attention

conker84 commented 3 years ago

@JArma19 are you sure that you're using spark with scala 2.12? can you please share you spark artifact name?

JArma19 commented 3 years ago

@conker84 yes I'm sure. I'm using pyspark '3.0.1' version which runs on scala 2.12 according to Spark doc (sorry don't know where I can find spark artifact name)

conker84 commented 3 years ago

Ok so that's the problem, we're not supporting yet Spark 3.x we plan to add that support during this month, in the meanwhile you should use spark >= 2.4.5

conker84 commented 3 years ago

I'm closing this since we found the solution, feel free to reopen in case you need it

JArma19 commented 3 years ago

Ok, actually you solved my problem, I downgraded to 2.4.5 and finall it worked. Thanks for your help!