ogrodnek / csv-serde

Hive SerDe for CSV
Apache License 2.0
141 stars 80 forks source link

Use current version of opencsv, update deprecated code, clean lint, and add support for newlines. #24

Closed woodrad closed 9 years ago

woodrad commented 9 years ago

This request is rather large, sorry about that. Before submitting this pull request, I ran tests using Hive versions 0.11.0 through 0.14.0 in the pom--all passed. Note Hive 0.14.0 dependencies do not resolve automatically.

Here is a summary of the small changes I made.

Finally, a summary of why I am submitting this pull request in the first place. opencsv does an alright job of managing embedded line breaks in csv files (it strips carriage returns and breaks that are not \n), but using this SerDe with Hive results in NULLs after every row containing a line break. I've included tests and code that will take \n, \r, and \r\n and output them as , , and respectively. I've singled out these breaks because they're the only ones defined in the csv standard.

Let me know what you think. Maybe we can put our heads together and solve things like #18 and #3.

woodrad commented 9 years ago

I took some time to test this using command-line hive on my local machine and learned some more things. As demonstrated by the tests I added, when given multiline Text, the SerDer correctly returns text with newlines stripped. However, the Text given to the SerDe is split before hitting the SerDe. Check it out.

hive -f test.hive

Logging initialized using configuration in jar:file:/usr/local/Cellar/hive/0.13.1/libexec/lib/hive-common-0.13.1.jar!/hive-log4j.properties Added ~/csv-serde/target/csv-serde-1.1.2-0.11.0-all.jar to class path Added resource: ~/csv-serde/target/csv-serde-1.1.2-0.11.0-all.jar OK Time taken: 45.311 seconds OK I was given: "hello","yes, ok","1","new I returned: [hello, yes, ok, 1, null] I was given: line" I returned: [null, null, null, null] hello yes, ok 1 NULL NULL NULL NULL NULL Time taken: 0.288 seconds, Fetched: 2 row(s) OK Time taken: 0.361 seconds

Is there anything we can do to force Hive to give us the entire csv file as Text?

Supporting gists: test.csv and test.hive.

woodrad commented 9 years ago

Cleaning the deprecated parts is a dupe of #8.

woodrad commented 9 years ago

It looks like this project is dead, so I'll maintain my changes in my fork. I'll leave these notes in closing, however.