hortonworks-spark / spark-llap

Apache License 2.0
101 stars 68 forks source link

'\' in spark dataframe row is treated as metacharacter by hive warehouse connector #265

Open gpjalocha opened 5 years ago

gpjalocha commented 5 years ago

Hello,

As in title, when I have spark dataframe with some field containing backslash, then after writing it to hive dataframe it is treated as escape character. And, when on some field backslash is the last character in string, then hive warehouse connector read this as '\,' so it escapes comma. Example:

Firstly create example hive table:

create database traffic;
create table traffic.example (field_1 string,field_2 string,field_3 string);
alter table traffic.example set tblproperties (escape.delim='\\');

then in pyspark(check out differences between the 'same' spark and hive dataframe)


from pyspark_llap.sql.session import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
hive.setDatabase('traffic')

l = [('delimiter','test','normal'),
    ('deli\\miter','test','backslash_inside'),
    ('delimi,ter,','te,st','commas'),
    ('delimiter\\','test\\','escape')]

df=spark.createDataFrame(l)
df.show()

#+-----------+-----+----------------+
#|         _1|   _2|              _3|
#+-----------+-----+----------------+
#|  delimiter| test|          normal|
#| deli\miter| test|backslash_inside|
#|delimi,ter,|te,st|          commas|
#| delimiter\|test\|          escape|
#+-----------+-----+----------------+

df.write.format(HiveWarehouseSession().DATAFRAME_TO_STREAM).option('table','example').save()
hive.table('example').show()
#+--------------------+--------+----------------+
#|                field_1|field_2|     field_3|
#+--------------------+--------+----------------+
#|         delimi,ter,|   te,st|          commas|
#|delimiter,test,es...|    null|            null|
#|           delimiter|    test|          normal|
#|           delimiter|    test|backslash_inside|
#+--------------------+--------+----------------+