Closed samminen closed 3 years ago
Hi @samminen ,
you could try to enable enableParallelRead
option, this will allows spark to read directly from the leaves and increase your read performance.
Try to enable it like this:
val DF = spark.read.format("memsql").option("ddlEndpoint", "url value").option("user","username").option("Password", "pwd").option("enableParallelRead","true").load("memsql table name")
And please also read an explanation and some notes about enableParallelRead
option in our docs.
Feel free to ask any other questions if that doesn't help you.
P.S. did you reassign your df
after repartitioning
? Because df.repartition(50)
returns a new dataset, so you should reassign it to another variable to work with repartitioned dataset properly
Thank you for the update; When I try this option, I see below error
java.lang.NoSuchMethodError: spray.json.package$.enrichString(Ljava/lang/String;)Lspray/json/RichString; at com.memsql.spark.MemsqlQueryHelpers$.GetPartitions(MemsqlQueryHelpers.scala:40) at com.memsql.spark.MemsqlRDD.getPartitions(MemsqlRDD.scala:22) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
spark connector jar: memsql-spark-connector_2.11-3.0.5-spark-2.3.4.jar spray json jar: spray-json_2.11-1.3.2.jar mariadb jar: mariadb-java-client-2.7.1.jar
if I remove option enableParallelRead, then I do not see above error. How can I use this feature to read parallel ?
What is missing ?
@samminen ,
we are using spray-json-1.3.5
version. Seems like your jar version is lower.
Please try to use 1.3.5
version or newer.
I still see same issue after upgrade spray-json-1.3.5. It is just one partition and using one core and 1 executer in clusert. This is big issue when we try to ready millions of records from memsql and create a hive table.
Kindly check and provide an update;
Hi @samminen ,
I still see same issue after upgrade spray-json-1.3.5
you mean this issue?
java.lang.NoSuchMethodError: spray.json.package$.enrichString(Ljava/lang/String;)Lspray/json/RichString;
at com.memsql.spark.MemsqlQueryHelpers$.GetPartitions(MemsqlQueryHelpers.scala:40)
at com.memsql.spark.MemsqlRDD.getPartitions(MemsqlRDD.scala:22)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
No. Spray Jason issue resolved. But after enableParallelRead, I still see just one partition with memsql spark connector read operation.
Not sure it worked for anyone ?
@samminen , could you please provide some examples of your code so we could investigate it and propose a working solution for you?
@samminen, as there's no response from you, I'll close this issue. Please feel free to reopen it if you have some additional questions
Hello Team, Here is my use case with memsql spark connector READ operation:
Actual issue: is, spark.read is always creating DF as single partition and because of that, this job not able to use available cores and executers.
I tried DF.repartitions(50) and then applied write operation, but no luck..
How can we make memsql spark.read() API create multiple partitions of DF, so that our job will use cluster resources effectively ?
because of this issue, for 100mn records, it took 28 hours to read from memsql table and write it to HIVE external table.
Any solution ?