woodrad commented 9 years ago

This request is rather large, sorry about that. Before submitting this pull request, I ran tests using Hive versions 0.11.0 through 0.14.0 in the pom--all passed. Note Hive 0.14.0 dependencies do not resolve automatically.

Here is a summary of the small changes I made.

Updated pom to use latest version of opencsv. Before this change version 2.3 from 2011 was being used.
Removed extraneous spaces at the end of lines.
Replace org.apache.hadoop.hive.serde.Constants and org.apache.hadoop.hive.serde2.SerDe with current equivalents. Both have been deprecated since Hive 0.11.0.
Added testDeserializeCustomSeparatorCustomEscape(), which shows using the same escape and quote chars does not result in an exception. (Line 77 of CSVSerdeTest). Removed comments around Line 154 of CSVSerde.

Finally, a summary of why I am submitting this pull request in the first place. opencsv does an alright job of managing embedded line breaks in csv files (it strips carriage returns and breaks that are not \n), but using this SerDe with Hive results in NULLs after every row containing a line break. I've included tests and code that will take \n, \r, and \r\n and output them as , , and respectively. I've singled out these breaks because they're the only ones defined in the csv standard.

Let me know what you think. Maybe we can put our heads together and solve things like #18 and #3.

woodrad commented 9 years ago

I took some time to test this using command-line hive on my local machine and learned some more things. As demonstrated by the tests I added, when given multiline Text, the SerDer correctly returns text with newlines stripped. However, the Text given to the SerDe is split before hitting the SerDe. Check it out.

hive -f test.hive

Logging initialized using configuration in jar:file:/usr/local/Cellar/hive/0.13.1/libexec/lib/hive-common-0.13.1.jar!/hive-log4j.properties Added ~/csv-serde/target/csv-serde-1.1.2-0.11.0-all.jar to class path Added resource: ~/csv-serde/target/csv-serde-1.1.2-0.11.0-all.jar OK Time taken: 45.311 seconds OK I was given: "hello","yes, ok","1","new I returned: [hello, yes, ok, 1, null] I was given: line" I returned: [null, null, null, null] hello yes, ok 1 NULL NULL NULL NULL NULL Time taken: 0.288 seconds, Fetched: 2 row(s) OK Time taken: 0.361 seconds

Is there anything we can do to force Hive to give us the entire csv file as Text?

Supporting gists: test.csv and test.hive.

woodrad commented 9 years ago

Cleaning the deprecated parts is a dupe of #8.

woodrad commented 9 years ago

It looks like this project is dead, so I'll maintain my changes in my fork. I'll leave these notes in closing, however.

SET textinputformat.record.delimiter = 'myDelimiter' will pass multiple lines from the mappers Hive spawns. This serde will read multiline input and return a row separated by \n in the right places.
tblproperties("skip.header.line.count"="1") in the CREATE statement will skip the first (or nth) row, which is great for reading CSV files that include headers.

ogrodnek / csv-serde

Use current version of opencsv, update deprecated code, clean lint, and add support for newlines. #24

hive -f test.hive