Closed woodrad closed 9 years ago
I took some time to test this using command-line hive on my local machine and learned some more things. As demonstrated by the tests I added, when given multiline Text, the SerDer correctly returns text with newlines stripped. However, the Text given to the SerDe is split before hitting the SerDe. Check it out.
Logging initialized using configuration in jar:file:/usr/local/Cellar/hive/0.13.1/libexec/lib/hive-common-0.13.1.jar!/hive-log4j.properties Added ~/csv-serde/target/csv-serde-1.1.2-0.11.0-all.jar to class path Added resource: ~/csv-serde/target/csv-serde-1.1.2-0.11.0-all.jar OK Time taken: 45.311 seconds OK I was given: "hello","yes, ok","1","new I returned: [hello, yes, ok, 1, null] I was given: line" I returned: [null, null, null, null] hello yes, ok 1 NULL NULL NULL NULL NULL Time taken: 0.288 seconds, Fetched: 2 row(s) OK Time taken: 0.361 seconds
Is there anything we can do to force Hive to give us the entire csv file as Text?
Cleaning the deprecated parts is a dupe of #8.
It looks like this project is dead, so I'll maintain my changes in my fork. I'll leave these notes in closing, however.
SET textinputformat.record.delimiter = 'myDelimiter'
will pass multiple lines from the mappers Hive spawns. This serde will read multiline input and return a row separated by \n in the right places.tblproperties("skip.header.line.count"="1")
in the CREATE
statement will skip the first (or nth) row, which is great for reading CSV files that include headers.
This request is rather large, sorry about that. Before submitting this pull request, I ran tests using Hive versions 0.11.0 through 0.14.0 in the pom--all passed. Note Hive 0.14.0 dependencies do not resolve automatically.
Here is a summary of the small changes I made.
Finally, a summary of why I am submitting this pull request in the first place. opencsv does an alright job of managing embedded line breaks in csv files (it strips carriage returns and breaks that are not \n), but using this SerDe with Hive results in NULLs after every row containing a line break. I've included tests and code that will take \n, \r, and \r\n and output them as, , and respectively. I've singled out these breaks because they're the only ones defined in the csv standard.
Let me know what you think. Maybe we can put our heads together and solve things like #18 and #3.