ogrodnek / csv-serde

Hive SerDe for CSV
Apache License 2.0
141 stars 80 forks source link

Specifying nulls #15

Open parautenbach opened 10 years ago

parautenbach commented 10 years ago

Is there a way to indicate a NULL value? The default Hive value is \N (capitalised). When I upload a CSV file containing that value, I simply get the value N.

According the the Hive DDL it is not possible to specify a SerDe and NULL terminator. Is there perhaps a class property for this SerDe that can be set (similar to how separator, quote and escape characters are set)? I could not find such a property in the source code.

It is vital that one can distinguish between NULL and empty values and it would be very helpful if this could be added as a feature.

ogrodnek commented 10 years ago

Hi, there's no property at the moment... I'm wondering how other serdes handle this?....

parautenbach commented 10 years ago

I guess the problem has more to do with an ambiguity within CSVs, rather than a SerDe. For JSON SerDes this won't exist, as there is a clear distinction. Assuming a three column CSV including header:

A,B,C foo,,bar

What does column B imply? A NULL or an empty string, assuming the Hive table's column B is of type STRING? I'm not sure what would be appropriate. If the type is anything but STRING, it becomes obvious, as the absence of a value for the source column B would indicate NULL (e.g. if the type was INT).

My suggestion (and I can't think of a better one right now), would be to interpret an empty value (as above) as an empty string when the Hive type is STRING. If the value is \N for any type, return Java null for that column's value (for non-STRING types an empty value would also imply NULL).

The onus would then be on the user loading data to explicitly encode values where it is desired to have NULL instead of empty string by setting the value to \N. It does, unfortunately, burden the user (consumer) with the internals behind the interface, in this case a specific convention used by Hive, which is seldom a good practice. Then again, the SerDe is very specific to Hive and won't be re-usable elsewhere.

ashrowty commented 8 years ago

I agree with @parautenbach . I have this issue currently. I don't really see the burden .. there needs to be some explicit way to specify what is NULL. Don't mind encoding NULLs as '\N' by default. Maybe this can be an option?

huozhanfeng commented 8 years ago

I meet this problem also.

bchiud commented 7 years ago

+1

prabhacloud commented 6 years ago

+1

sriniprash commented 6 years ago

+1

maximvl commented 6 years ago

+1

lkudyba commented 6 years ago

Hey @ogrodnek - I know it's an old issue but are you by any chance planning to introduce such a change?

ogrodnek commented 6 years ago

Hi @lkudyba ,

This serde code was contributed to Hive back in 2014, and is being maintained there.

See: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde and https://issues.apache.org/jira/browse/HIVE-7777

(specific code is: https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/OpenCSVSerde.java )

As an aside, I am wondering if it's possible to handle this using an if conditional?

i.e. for reading select if(v == '\N', null, v), for writing select if(v == null, '\N', v) ?