Closed bolcman closed 4 years ago
Hello, Aleks.
The connector runs one LOAD DATA query per a dataFrame partition. The number of queries could be controlled by doing the repartition before saving.
Closing this for now - please reopen if you are unable to control write parallelism by managing the number of partitions. Spark has a repartition operation you can use to change the number of partitions before doing the write to MemSQL.
You can also increase the number of child aggregators in your cluser and list them in the dmlEndpoints
configuration. We will load balance the write operation over all of the listed aggregators.
Hello ,
is there a way we can control max number of concurrent connection from job side on save action?
the use case is following:
the computation that we are doing is pretty intensive, but the number of output rows is not so big. now we are able to increase the parallelism in spark cluster increasing minNumPostShufflePartitions.
but on the other side I see a lot of pending connection to memsql now:
LOAD DATA LOCAL INFILE '###.lz4' INTO TABLE.... ...
now, in theory we can create a user queue in memsql cluster, but i was trying to see if there is way to control this from connector side.
i tried increasing insertBatchSize to some big number but that didn't help.
thanks, Aleks