issue with number of partitions when we use spark.read

memsql / singlestore-spark-connector

A connector for SingleStore and Spark

Apache License 2.0

160 stars 54 forks source link

issue with number of partitions when we use spark.read #65

Closed samminen closed 3 years ago

samminen commented 3 years ago

Hello Team, Here is my use case with memsql spark connector READ operation:

I have table in MEMSQL and I am reading this table using below memsql connector read operation val DF = spark.read.format("memsql").option("ddlEndpoint", "url value").option("user","username").option("Password", "pwd").load("memsql table name")
1. Once I read it to spark dataframe, I used below spark write API to write memsql data into hive table. DF.write.mode("overwrite").format("orc").saveAsTable("hive tabl ename")

In this case, in my cluster, I see only one executor and one task running actively even though I specified 20 cores and 120 executers.

Actual issue: is, spark.read is always creating DF as single partition and because of that, this job not able to use available cores and executers.

I tried DF.repartitions(50) and then applied write operation, but no luck..

How can we make memsql spark.read() API create multiple partitions of DF, so that our job will use cluster resources effectively ?

because of this issue, for 100mn records, it took 28 hours to read from memsql table and write it to HIVE external table.

Any solution ?

blinov-ivan commented 3 years ago

Hi @samminen , you could try to enable enableParallelRead option, this will allows spark to read directly from the leaves and increase your read performance. Try to enable it like this:

val DF = spark.read.format("memsql").option("ddlEndpoint", "url value").option("user","username").option("Password", "pwd").option("enableParallelRead","true").load("memsql table name")

And please also read an explanation and some notes about enableParallelRead option in our docs. Feel free to ask any other questions if that doesn't help you.

P.S. did you reassign your df after repartitioning? Because df.repartition(50) returns a new dataset, so you should reassign it to another variable to work with repartitioned dataset properly

samminen commented 3 years ago

Thank you for the update; When I try this option, I see below error

java.lang.NoSuchMethodError: spray.json.package$.enrichString(Ljava/lang/String;)Lspray/json/RichString; at com.memsql.spark.MemsqlQueryHelpers$.GetPartitions(MemsqlQueryHelpers.scala:40) at com.memsql.spark.MemsqlRDD.getPartitions(MemsqlRDD.scala:22) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)

spark connector jar: memsql-spark-connector_2.11-3.0.5-spark-2.3.4.jar spray json jar: spray-json_2.11-1.3.2.jar mariadb jar: mariadb-java-client-2.7.1.jar

if I remove option enableParallelRead, then I do not see above error. How can I use this feature to read parallel ?

What is missing ?

blinov-ivan commented 3 years ago

@samminen , we are using spray-json-1.3.5 version. Seems like your jar version is lower. Please try to use 1.3.5 version or newer.

samminen commented 3 years ago

I still see same issue after upgrade spray-json-1.3.5. It is just one partition and using one core and 1 executer in clusert. This is big issue when we try to ready millions of records from memsql and create a hive table.

Kindly check and provide an update;

blinov-ivan commented 3 years ago

Hi @samminen , I still see same issue after upgrade spray-json-1.3.5 you mean this issue?

java.lang.NoSuchMethodError: spray.json.package$.enrichString(Ljava/lang/String;)Lspray/json/RichString;
at com.memsql.spark.MemsqlQueryHelpers$.GetPartitions(MemsqlQueryHelpers.scala:40)
at com.memsql.spark.MemsqlRDD.getPartitions(MemsqlRDD.scala:22)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)

samminen commented 3 years ago

No. Spray Jason issue resolved. But after enableParallelRead, I still see just one partition with memsql spark connector read operation.

Not sure it worked for anyone ?

blinov-ivan commented 3 years ago

@samminen , could you please provide some examples of your code so we could investigate it and propose a working solution for you?

blinov-ivan commented 3 years ago

@samminen, as there's no response from you, I'll close this issue. Please feel free to reopen it if you have some additional questions