lukasmartinelli / pgfutter

Import CSV and JSON into PostgreSQL the easy way
MIT License
1.33k stars 128 forks source link

Does not accepted escaped " in CSV #7

Open houshuang opened 8 years ago

houshuang commented 8 years ago

I don't know if this is part of the official CSV specification (if there is one), but it would be useful to handle escaped quotation marks. For example, pgfutter chokes on this line:

"cV7QpZd-EeSKzSIAC0cT7w@2","cV7QpZd-EeSKzSIAC0cT7w",2,5,"{\"typeName\":\"cml\",\"definition\":{\"dtdId\":\"assess/1\",\"value\":\"<co-content><text>For the gene At3g59490, retrieve the corresponding protein sequence from TAIR</text><text>(http://www.arabidopsis.org/tools/bulk/sequences/index.jsp). Remember to choose the correct dataset and output option.</text><text>Now, navigate to BLASTP at NCBI and paste your genes sequence into the “query sequence” box. Set the database to “non-redundant protein sequences (nr)”, keep all settings at default, and click BLAST.</text><text>Take note of the top match (ortholog) for each of the other species, for 20 different species excluding your query species. Which species’ gene is most closely related to your query gene?</text></co-content>\"}}",2015-02-17 22:14:27.187

however, when I remove all \" with sed, it imports beautifully.

lukasmartinelli commented 8 years ago

Thats the Golang CSV reader which has a weird escaping rule compared to the rest of the world. https://golang.org/pkg/encoding/csv

"the ""word"" is true","a ""quoted-field""

results in

{`the "word" is true`, `a "quoted-field"`}

I will do more research whether it can be configured for the Go CSV reader to support custom escape characters.

leofidus commented 8 years ago

I wouldn't call Golang's CSV reader weird, that's just the weird CSV format. RFC 4180 says

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example: "aaa","b""bb","ccc"

lukasmartinelli commented 8 years ago

I wouldn't call Golang's CSV reader weird, that's just the weird CSV format. RFC 4180 says

Okay that makes sense. But would be cool if it was configurable (perhaps it is and I just didn't found out).