memsql / singlestore-spark-connector

A connector for SingleStore and Spark
Apache License 2.0
160 stars 54 forks source link

control max number of concurrent connections on save #61

Closed bolcman closed 4 years ago

bolcman commented 4 years ago

Hello ,

is there a way we can control max number of concurrent connection from job side on save action?

the use case is following:

the computation that we are doing is pretty intensive, but the number of output rows is not so big. now we are able to increase the parallelism in spark cluster increasing minNumPostShufflePartitions.

but on the other side I see a lot of pending connection to memsql now:

LOAD DATA LOCAL INFILE '###.lz4' INTO TABLE.... ...

now, in theory we can create a user queue in memsql cluster, but i was trying to see if there is way to control this from connector side.

i tried increasing insertBatchSize to some big number but that didn't help.

thanks, Aleks

AdalbertMemSQL commented 4 years ago

Hello, Aleks.

The connector runs one LOAD DATA query per a dataFrame partition. The number of queries could be controlled by doing the repartition before saving.

carlsverre commented 4 years ago

Closing this for now - please reopen if you are unable to control write parallelism by managing the number of partitions. Spark has a repartition operation you can use to change the number of partitions before doing the write to MemSQL.

You can also increase the number of child aggregators in your cluser and list them in the dmlEndpoints configuration. We will load balance the write operation over all of the listed aggregators.