shashir / jdbc2parquet

Use Spark to dump SQL tables into Parquet files.
1 stars 0 forks source link

Sample Configuration #1

Closed cryptickp closed 8 years ago

cryptickp commented 8 years ago

Could you upload how sample configuration looks like.

shashir commented 8 years ago

I can commit an example, but here is a quick one (this is all just a thin wrapper job on top of JbdcRdd):

Run the job with:

  spark-submit \
    --class com.imgur.spark.jdbc2parquet.JDBC2Parquet \
    --master $SPARK_MASTER \
    --driver-memory 3072m \
    --executor-memory 3072m \
    --conf "spark.cores.max=20" \
    /path/to/jdbc2parquet/target/jdbc2parquet-0.0.1-jar-with-dependencies.jar \
    -c configs/clickstream_table_etl_config.json

The config file is passed with -c flag to the job.

E.g. clickstream_table_etl_config.json

{
  "runLocal" : false,
  "jdbcConnection" : {
    "driverClass" : "com.mysql.jdbc.Driver",
    "connectionPath" : "jdbc:mysql://mysql/db?autoReconnect=true",
    "user" : "user",
    "password" : "password"
  },
  "parquetOutputPath" : "hdfs://hdfs/clickstream",
  "table": "clickstream",
  "partitions" : 10000,
  "indexColumn" : "id",
  "minIndex" : 0,
  "maxIndex" : 30000000000,
  "schema" : [
    {
      "readColumn" : "id",
      "writeType" : "Long"
    },
    {
      "readColumn" : "user_id",
      "writeType" : "String"
    },
    {
      "readColumn" : "item_id",
      "writeType" : "String"
    },
    {
      "readColumn" : "time_stamp",
      "writeType" : "Epoch",
      "writeColumn" : "ts"
    }
  ]
}
shashir commented 8 years ago

If you are using this, you should probably update the pom.xml with your version of Spark. Some things might break in newer versions of Spark, but should be easy to fix (read up on how Spark and SparkSQL work with Parquet).

cryptickp commented 8 years ago

Sure thanks. I'll give a try.

cryptickp commented 8 years ago

@shashir this did indeed worked although I've to do changes on my cluster. Do you know how can you do this to entire DB. Rather than manually specifying schema to every table.