Open psorari opened 6 years ago
Every things seems correct. Just specify a formatter to your hbase scan to print integer value as integer and not as bytes.
Do we have any option to convert integer values to integer only while writing dataframe to hbase through pyspark by default while writing dataframe to hbase integer values are converting to byte type in hbase table?
Below is the code: catalog2 = { "table": {"namespace": "default","name": "trip_test1"}, "rowkey": "key1", "columns": { "serial_no": {"cf": "rowkey","col": "key1","type": "string"}, "payment_type": {"cf": "sales","col": "payment_type","type": "string"}, "fare_amount": {"cf": "sales","col": "fare_amount","type": "string"}, "surcharge": {"cf": "sales","col": "surcharge","type": "string"}, "mta_tax": {"cf": "sales","col": "mta_tax","type": "string"}, "tip_amount": {"cf": "sales","col": "tip_amount","type": "string"}, "tolls_amount": {"cf": "sales","col": "tolls_amount","type": "string"}, "total_amount": {"cf": "sales","col": "total_amount","type": "string"} } }
import json cat2=json.dumps(catalog2)
df.write.option("catalog",cat2).option("newtable","5").format("org.apache.spark.sql.execution.datasources.hbase").save()
output: \x00\x00\x03\xE7 column=sales:payment_type, timestamp=1529495930994, value=CSH \x00\x00\x03\xE7 column=sales:surcharge, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE7 column=sales:tip_amount, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE7 column=sales:tolls_amount, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE7 column=sales:total_amount, timestamp=1529495930994, value=@!\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE8 column=sales:fare_amount, timestamp=1529495930994, value=@\x18\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE8 column=sales:mta_tax, timestamp=1529495930994, value=?\xE0\x00\x00\x00\x00\x00\x00
expected output: 999 column=sales:fare_amount, timestamp=1529392479358, value=8.0 999 column=sales:mta_tax, timestamp=1529392479358, value=0.5 999 column=sales:payment_type, timestamp=1529392479358, value=CSH 999 column=sales:surcharge, timestamp=1529392479358, value=0.0 999 column=sales:tip_amount, timestamp=1529392479358, value=0.0 999 column=sales:tolls_amount, timestamp=1529392479358, value=0.0 999 column=sales:total_amount, timestamp=1529392479358, value=8.5
Hello, @vineelavelagapudi @psorari I am also getting the same output after loading the data into Hbase. All Int values are showing in Bytes format. Could you please let me know if you find out any solution.
just convert datatype into string
for i in df.columns: df= df.withColumn(i, col(i).cast("string"))
I am using shc "1.1.1-2.1-s_2.11" with Spark 2.3 and scala 2.11.8
While loading spark dataframe into Hbase, integer columns are not loading properly and some wierd value is showing up at Hbase. Example , below is the scan 'test' from hbase - See, the Row is integer type . \x00\x0FB\xE3 column=td:buysell, timestamp=1526288094752, value=BUY \x00\x0FB\xE3 column=td:eoddate, timestamp=1526288094752, value=20180420 \x00\x0FB\xE3 column=td:exclegalentity, timestamp=1526288094752, value=ICE E - LE \x00\x0FB\xE3 column=td:intaccbook, timestamp=1526288094752, value=MARKET \x00\x0FB\xE3 column=td:intaccountid, timestamp=1526288094752, value=\x00\x00\x5Cb \x00\x0FB\xE3 column=td:trader, timestamp=1526288094752, value=uk40154 \x00\x0F\x9B\xF4 column=td:buysell, timestamp=1526288094752, value=BUY \x00\x0F\x9B\xF4 column=td:eoddate, timestamp=1526288094752, value=20180321 \x00\x0F\x9B\xF4 column=td:exclegalentity, timestamp=1526288094752, value=ICE E - LE \x00\x0F\x9B\xF4 column=td:intaccbook, timestamp=1526288094752, value=FUTURES \x00\x0F\x9B\xF4 column=td:intaccountid, timestamp=1526288094752, value=\x00\x0C\x89G \x00\x0F\x9B\xF4 column=td:trader, timestamp=1526288094752, value=uk40240 \x00\x10\x0E\x1D column=td:buysell, timestamp=1526288094752, value=BUY \x00\x10\x0E\x1D column=td:eoddate, timestamp=1526288094752, value=20180328 \x00\x10\x0E\x1D column=td:exclegalentity, timestamp=1526288094752, value=ICE E - LE \x00\x10\x0E\x1D column=td:intaccbook, timestamp=1526288094752, value=FUTURES
While they should have been : ROW COLUMN+CELL 1000163 column=td:buysell, timestamp=1526290263982, value=BUY 1000163 column=td:eoddate, timestamp=1526290263982, value=20180420 1000163 column=td:exclegalentity, timestamp=1526290263982, value=ICE E - LE 1000163 column=td:intaccbook, timestamp=1526290263982, value=MARKET 1000163 column=td:intaccountid, timestamp=1526290263982, value=23650 1000163 column=td:trader, timestamp=1526290263982, value=uk40154 1022964 column=td:buysell, timestamp=1526290263982, value=BUY 1022964 column=td:eoddate, timestamp=1526290263982, value=20180321 1022964 column=td:exclegalentity, timestamp=1526290263982, value=ICE E - LE 1022964 column=td:intaccbook, timestamp=1526290263982, value=FUTURES 1022964 column=td:intaccountid, timestamp=1526290263982, value=821575
Below is table cataloge: def catalog = s"""{ |"table":{"namespace":"default", "name":"test","tableCoder":"PrimitiveType"}, |"rowkey":"tradeid", |"columns":{ |"tradeid":{"cf":"rowkey", "col":"tradeid", "type":"int"}, |"eoddate":{"cf":"td", "col":"eoddate", "type":"int"}, |"buysell":{"cf":"td", "col":"buysell", "type":"string"}, |"trader":{"cf":"td", "col":"trader", "type":"string"}, |"exclegalentity":{"cf":"td", "col":"exclegalentity", "type":"string"}, |"intaccountid":{"cf":"td", "col":"intaccountid", "type":"int"}, |"intaccbook":{"cf":"td", "col":"intaccbook", "type":"string"} |} |}""".stripMargin below is dataframe schema: root |-- tradeid: integer (nullable = true) |-- eoddate: string (nullable = true) |-- buysell: string (nullable = true) |-- trader: string (nullable = true) |-- exclegalentity: string (nullable = true) |-- intaccountid: integer (nullable = true) |-- intaccbook: string (nullable = true)
And I am simply loading it using below method as explained in examples :
df.write.options(Map(HBaseTableCatalog.tableCatalog -> catalog,HBaseTableCatalog.newTable -> "5")).format("org.apache.spark.sql.execution.datasources.hbase").save()
While doing with all columns string it shows up correct. Could someone plese point me where I am getting it wrong.
Many Thanks