Closed kokosing closed 5 years ago
The Spark way looks good. You can see the code here it uses to build the queries: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
Notes:
Following this approach we could have configuration (eg. mysql.properties) that would tell what column should be used for partitioning, number of partitions, low and max column partitioning column values.
Potentially those things can be realized automatically. Manual configuration should be considered fine-tuning, and things should work out of the box, if possible. Following sqoop's example, we could use primary key for partitioning and pull some information from target DB stats for low/high (we want to do this anyway, for CBO).
Potentially those things can be realized automatically. Manual configuration should be considered fine-tuning, and things should work out of the box, if possible. Following sqoop's example, we could use primary key for partitioning and pull some information from target DB stats for low/high (we want to do this anyway, for CBO).
I was thinking about that too, I am only a bit concern that pulling information about low/high values might not be that trivial depending on RDBMS. This could be great extension, but let's start with something simpler and more straightforward.
I have no doubts this needs to be RDBMS-specific. But we need those extension points anyway, quite soon.
Currently read jdbc-based tables are using single connection which could be slow. However other data engines are able to do a parallel table read. See the below.
In Sqoop - https://sqoop.apache.org/docs/1.4.7/SqoopUserGuide.html
In Spark - http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
Related: - https://www.percona.com/blog/2016/08/17/apache-spark-makes-slow-mysql-queries-10x-faster/
I like the way that Spark is using. Following this approach we could have configuration (eg. mysql.properties) that would tell what column should be used for partitioning, number of partitions, low and max column partitioning column values. Then read from that table could be easily parallelized.