pingcap / tispark

TiSpark is built for running Apache Spark on top of TiDB/TiKV
Apache License 2.0
884 stars 244 forks source link

tispark read data to hdfs #908

Closed wangdabin1216 closed 5 years ago

wangdabin1216 commented 5 years ago

I tried to use tispark instead of sqoop to draw numbers from tidb to hdfs. Is there a split-by like sqoop, how to control? Thank you

marsishandsome commented 5 years ago

there's no split-by in spark as i know, instead you can use Coalesce, see https://medium.com/@mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

wangdabin1216 commented 5 years ago

@marsishandsome

I tried to use Coalesce=32, but this only controls the output of the program, but my purpose is to read it evenly from the tidb, I am worried about whether there will be data skew or OOM problems.

image

marsishandsome commented 5 years ago

tispark already solved data skew problem, you do not need do anything, just use it.

wangdabin1216 commented 5 years ago

thx I will have a try