hortonworks-spark / shc

The Apache Spark - Apache HBase Connector is a library to support Spark accessing HBase table as external data source or sink.
Apache License 2.0
552 stars 281 forks source link

Numeric Keys are Not Split Properly #330

Open davidov541 opened 4 years ago

davidov541 commented 4 years ago

If you have keys that start exclusively with numeric values, all rows will be put into a single region. That region will be split as necessary based on the settings in HBase, but the other regions will remain empty. For example, if I request 100 regions, it will create 100 regions between "aaaaaa" and "zzzzzz". If I then add 1 million rows with numeric keys, all of them will go into the first region. If I add more rows, then the HBase cluster may decide to split the first region, but the other regions will remain empty.

I've included some source code below that I've written which demonstrates this issue easily. Increasing the number of samples allows you to see the new regions were created, but even at 1 million, you wouldn't expect the minimum and maximum keys to be in the same region if you asked for 100 regions.

val numValues = 1000000
val startValue = 48000000

val df = spark.range(startValue, startValue + numValues).select($"id".cast("string") as "pat_id", lit("RUL_ID_VALUE") as "rul_id", lit("ROW_KEY_VALUE") as "rowkey")

/* Cast DataTypes */

val cir_fitered_data_DF = df.select(
  $"pat_id" as "fullrowkey",
  $"rul_id",
  $"rowkey" as "ROW_KEY");

//Schema to load into HBase perf table

def mbr_cir_load_hbase_catalog =
  s"""{
     |"table":{"namespace": "u405317_sandbox", "name":"PerfTest"},
     |"rowkey":"key",
     |"columns":{
     |"fullrowkey":{"cf":"rowkey", "col":"key", "type":"string"},
     |"rul_id":{"cf":"P1", "col":"rul_id", "type":"string"},
     |"ROW_KEY":{"cf":"P1", "col":"rowkey", "type":"string"}
     |}
     |}""".stripMargin

cir_fitered_data_DF.show(false);

cir_fitered_data_DF.write.options(Map(HBaseTableCatalog.tableCatalog -> mbr_cir_load_hbase_catalog, HBaseTableCatalog.newTable -> "100")).format("org.apache.spark.sql.execution.datasources.hbase").save();