hortonworks-spark / shc

The Apache Spark - Apache HBase Connector is a library to support Spark accessing HBase table as external data source or sink.
Apache License 2.0
552 stars 280 forks source link

Integer values as bytes while Writing dataframe to hbase #265

Closed vineelavelagapudi closed 6 years ago

vineelavelagapudi commented 6 years ago

Do we have any option to convert integer values to integer only while writing dataframe to hbase through pyspark by default while writing dataframe to hbase integer values are converting to byte type in hbase table?

Below is the code: catalog2 = { "table": {"namespace": "default","name": "trip_test1"}, "rowkey": "key1", "columns": { "serial_no": {"cf": "rowkey","col": "key1","type": "string"}, "payment_type": {"cf": "sales","col": "payment_type","type": "string"}, "fare_amount": {"cf": "sales","col": "fare_amount","type": "string"}, "surcharge": {"cf": "sales","col": "surcharge","type": "string"}, "mta_tax": {"cf": "sales","col": "mta_tax","type": "string"}, "tip_amount": {"cf": "sales","col": "tip_amount","type": "string"}, "tolls_amount": {"cf": "sales","col": "tolls_amount","type": "string"}, "total_amount": {"cf": "sales","col": "total_amount","type": "string"} } }

import json cat2=json.dumps(catalog2)

df.write.option("catalog",cat2).option("newtable","5").format("org.apache.spark.sql.execution.datasources.hbase").save()

output: \x00\x00\x03\xE7 column=sales:payment_type, timestamp=1529495930994, value=CSH \x00\x00\x03\xE7 column=sales:surcharge, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE7 column=sales:tip_amount, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE7 column=sales:tolls_amount, timestamp=1529495930994, value=\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE7 column=sales:total_amount, timestamp=1529495930994, value=@!\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE8 column=sales:fare_amount, timestamp=1529495930994, value=@\x18\x00\x00\x00\x00\x00\x00 \x00\x00\x03\xE8 column=sales:mta_tax, timestamp=1529495930994, value=?\xE0\x00\x00\x00\x00\x00\x00

expected output: 999 column=sales:fare_amount, timestamp=1529392479358, value=8.0 999 column=sales:mta_tax, timestamp=1529392479358, value=0.5 999 column=sales:payment_type, timestamp=1529392479358, value=CSH 999 column=sales:surcharge, timestamp=1529392479358, value=0.0 999 column=sales:tip_amount, timestamp=1529392479358, value=0.0 999 column=sales:tolls_amount, timestamp=1529392479358, value=0.0 999 column=sales:total_amount, timestamp=1529392479358, value=8.5

jyothirmai2309 commented 6 years ago

we can convert fields datatypes to string datatype in pyspark before writing to hbase