ogrodnek / csv-serde

Hive SerDe for CSV
Apache License 2.0
140 stars 80 forks source link

Getting issue when we have CR (Carriage Return) in between fields in the input #18

Open nats82 opened 10 years ago

nats82 commented 10 years ago

I am facing issue while parsing the records that has CR in between the fields.

Example:

"18","Agent System Review","To identify the calls where the agents have taken time to research or check information in the system tools or CRM.^M ","b5b553d2-81ab-4ec3-83e0-71ae3cf4afab","1","8.63","9.58","10.49","70","NAEnglish TeleUniversal 8.0.1.91034"

There is a CR (^M) after CRM which causes the serde to consider it as a new record, even though this should one record. Is there a way these kind of input data issue can be handled in this serde?

billou2k commented 10 years ago

same here. It looks like the issue is due to the InputFormat (TextInputFormat) splitting the text into records everytime a linebreak exist (it doesnt care whether it's between quotes or not) The serde gets those partial records as input and cannot recreate your original record...

nats82 commented 10 years ago

Thank you.. I created some sed scripts to remove those CR in between lines.

On Tue, Oct 14, 2014 at 2:51 PM, billou2k notifications@github.com wrote:

same here. It looks like the issue is due to the InputFormat (TextInputFormat) splitting the text into records everytime a linebreak exist (it doesnt care whether it's between quotes or not) The serde gets those partial records as input and cannot recreate your original record...

— Reply to this email directly or view it on GitHub https://github.com/ogrodnek/csv-serde/issues/18#issuecomment-59096607.