hortonworks-spark / shc

The Apache Spark - Apache HBase Connector is a library to support Spark accessing HBase table as external data source or sink.
Apache License 2.0
552 stars 281 forks source link

Dataframe integer columns are not loading properly #254

Open psorari opened 6 years ago

psorari commented 6 years ago

I am using shc "1.1.1-2.1-s_2.11" with Spark 2.3 and scala 2.11.8

While loading spark dataframe into Hbase, integer columns are not loading properly and some wierd value is showing up at Hbase. Example , below is the scan 'test' from hbase - See, the Row is integer type . \x00\x0FB\xE3 column=td:buysell, timestamp=1526288094752, value=BUY \x00\x0FB\xE3 column=td:eoddate, timestamp=1526288094752, value=20180420 \x00\x0FB\xE3 column=td:exclegalentity, timestamp=1526288094752, value=ICE E - LE \x00\x0FB\xE3 column=td:intaccbook, timestamp=1526288094752, value=MARKET \x00\x0FB\xE3 column=td:intaccountid, timestamp=1526288094752, value=\x00\x00\x5Cb \x00\x0FB\xE3 column=td:trader, timestamp=1526288094752, value=uk40154 \x00\x0F\x9B\xF4 column=td:buysell, timestamp=1526288094752, value=BUY \x00\x0F\x9B\xF4 column=td:eoddate, timestamp=1526288094752, value=20180321 \x00\x0F\x9B\xF4 column=td:exclegalentity, timestamp=1526288094752, value=ICE E - LE \x00\x0F\x9B\xF4 column=td:intaccbook, timestamp=1526288094752, value=FUTURES \x00\x0F\x9B\xF4 column=td:intaccountid, timestamp=1526288094752, value=\x00\x0C\x89G \x00\x0F\x9B\xF4 column=td:trader, timestamp=1526288094752, value=uk40240 \x00\x10\x0E\x1D column=td:buysell, timestamp=1526288094752, value=BUY \x00\x10\x0E\x1D column=td:eoddate, timestamp=1526288094752, value=20180328 \x00\x10\x0E\x1D column=td:exclegalentity, timestamp=1526288094752, value=ICE E - LE \x00\x10\x0E\x1D column=td:intaccbook, timestamp=1526288094752, value=FUTURES

While they should have been : ROW COLUMN+CELL 1000163 column=td:buysell, timestamp=1526290263982, value=BUY 1000163 column=td:eoddate, timestamp=1526290263982, value=20180420 1000163 column=td:exclegalentity, timestamp=1526290263982, value=ICE E - LE 1000163 column=td:intaccbook, timestamp=1526290263982, value=MARKET 1000163 column=td:intaccountid, timestamp=1526290263982, value=23650 1000163 column=td:trader, timestamp=1526290263982, value=uk40154 1022964 column=td:buysell, timestamp=1526290263982, value=BUY 1022964 column=td:eoddate, timestamp=1526290263982, value=20180321 1022964 column=td:exclegalentity, timestamp=1526290263982, value=ICE E - LE 1022964 column=td:intaccbook, timestamp=1526290263982, value=FUTURES 1022964 column=td:intaccountid, timestamp=1526290263982, value=821575

Below is table cataloge: def catalog = s"""{ |"table":{"namespace":"default", "name":"test","tableCoder":"PrimitiveType"}, |"rowkey":"tradeid", |"columns":{ |"tradeid":{"cf":"rowkey", "col":"tradeid", "type":"int"}, |"eoddate":{"cf":"td", "col":"eoddate", "type":"int"}, |"buysell":{"cf":"td", "col":"buysell", "type":"string"}, |"trader":{"cf":"td", "col":"trader", "type":"string"}, |"exclegalentity":{"cf":"td", "col":"exclegalentity", "type":"string"}, |"intaccountid":{"cf":"td", "col":"intaccountid", "type":"int"}, |"intaccbook":{"cf":"td", "col":"intaccbook", "type":"string"} |} |}""".stripMargin below is dataframe schema: root |-- tradeid: integer (nullable = true) |-- eoddate: string (nullable = true) |-- buysell: string (nullable = true) |-- trader: string (nullable = true) |-- exclegalentity: string (nullable = true) |-- intaccountid: integer (nullable = true) |-- intaccbook: string (nullable = true)

And I am simply loading it using below method as explained in examples :

df.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog,HBaseTableCatalog.newTable -> "5")).format("org.apache.spark.sql.execution.datasources.hbase").save()

While doing with all columns string it shows up correct. Could someone plese point me where I am getting it wrong.

Many Thanks

sbarnoud commented 6 years ago

Every things seems correct. Just specify a formatter to your hbase scan to print integer value as integer and not as bytes.

vineelavelagapudi commented 6 years ago

Do we have any option to convert integer values to integer only while writing dataframe to hbase through pyspark by default while writing dataframe to hbase integer values are converting to byte type in hbase table?

Below is the code: catalog2 = { "table": {"namespace": "default","name": "trip_test1"}, "rowkey": "key1", "columns": { "serial_no": {"cf": "rowkey","col": "key1","type": "string"}, "payment_type": {"cf": "sales","col": "payment_type","type": "string"}, "fare_amount": {"cf": "sales","col": "fare_amount","type": "string"}, "surcharge": {"cf": "sales","col": "surcharge","type": "string"}, "mta_tax": {"cf": "sales","col": "mta_tax","type": "string"}, "tip_amount": {"cf": "sales","col": "tip_amount","type": "string"}, "tolls_amount": {"cf": "sales","col": "tolls_amount","type": "string"}, "total_amount": {"cf": "sales","col": "total_amount","type": "string"} } }

import json cat2=json.dumps(catalog2)

df.write.option("catalog",cat2).option("newtable","5").format("org.apache.spark.sql.execution.datasources.hbase").save()

output: \x00\x00\x03\xE7 column=sales:payment_type, timestamp=1529495930994, value=CSH \x00\x00\x03\xE7 column=sales:surcharge, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE7 column=sales:tip_amount, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE7 column=sales:tolls_amount, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE7 column=sales:total_amount, timestamp=1529495930994, value=@!\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE8 column=sales:fare_amount, timestamp=1529495930994, value=@\x18\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE8 column=sales:mta_tax, timestamp=1529495930994, value=?\xE0\x00\x00\x00\x00\x00\x00

expected output: 999 column=sales:fare_amount, timestamp=1529392479358, value=8.0 999 column=sales:mta_tax, timestamp=1529392479358, value=0.5 999 column=sales:payment_type, timestamp=1529392479358, value=CSH 999 column=sales:surcharge, timestamp=1529392479358, value=0.0 999 column=sales:tip_amount, timestamp=1529392479358, value=0.0 999 column=sales:tolls_amount, timestamp=1529392479358, value=0.0 999 column=sales:total_amount, timestamp=1529392479358, value=8.5

srini3463 commented 6 years ago

Hello, @vineelavelagapudi @psorari I am also getting the same output after loading the data into Hbase. All Int values are showing in Bytes format. Could you please let me know if you find out any solution.

jyothirmai2309 commented 6 years ago

just convert datatype into string

for i in df.columns: df= df.withColumn(i, col(i).cast("string"))