qbicsoftware / scark-cli

0 stars 1 forks source link

Examine numPartitions #33

Open Zethson opened 5 years ago

Zethson commented 5 years ago

Not sure whether in the real network all executors are doing work or just a single one.

I shall investigate numPartitions. of jdbc.read

More info: https://stackoverflow.com/questions/41085238/what-is-the-meaning-of-partitioncolumn-lowerbound-upperbound-numpartitions-pa

Zethson commented 5 years ago
        val dfs = for {
          table <- tables
        } yield (table, spark.read.jdbc(databaseProperties.jdbcURL, table,  columnName="id", lowerBound=1L,
          upperBound=100000L, numPartitions=3, connectionProperties))

This works, but only if diong a simple query on just a single table

Select * from Consequence

The issue is that not every column may have an ID.

Zethson commented 5 years ago

https://stackoverflow.com/questions/56534189/jdbc-to-spark-dataframe-how-to-ensure-even-partitioning

Last comment about even partitioning may help.

Zethson commented 5 years ago

https://stackoverflow.com/questions/52530171/is-there-a-way-to-define-partitioncolumn-in-option-partitioncolumn-colname

Predicates if there are string columns to partition by

Zethson commented 5 years ago

Maybe this will help fetching all primary keys?

https://dzone.com/articles/the-right-way-to-use-spark-and-jdbc