ohmage / server

The ohmage server application.
37 stars 25 forks source link

escaping CRLF in csv output #460

Closed jeroen closed 11 years ago

jeroen commented 11 years ago

I am using the csv output to make a live snack dashboard, but it is breaking because some of the text responses contain newline characters which are not properly escaped. The result is that csv parsers think that it is the beginning of a new record, and then choke.

I verified with several clients, and even MS excel doesn't do it right. Example can be found on http://lausd.mobilizingcs.org, export campaign Snack P5 Jefferson. The 14th row.

jojenki commented 11 years ago

As stated in 4180:

Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.

Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

Because we have some fields that may have double quotes (i.e. user-generated strings), we must allow for our CSV fields to contain double quotes. While it states that Excel doesn't adhere to double quotes, it has since changed and does adhere to them now.

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.

This was actually the issue in this particular case. There was a user-generated string with quotes, which was confusing the parser. This has been fixed by "sanitizing" all strings by replacing double quotes with a pair of double quotes.

Also, because some fields are quoted, it makes sense to quote all fields. This will make the output uniform across all fields and not require the parser to sometimes remove surrounding quotes and other times not. Again, any sane parser should be able to handle this. The specification explicitly states that this is acceptable but that some software (i.e. Microsoft Excel) may not adhere to this; however, we should program to the specification, not the exceptions. Also, Microsoft has fixed this.

jeroen commented 11 years ago

Unfortunately 4180 does not seem to be very well adopted. It might be safer for CSV output to replace double quotes appearing inside text with single quotes, and also strip any reserved characters?

jojenki commented 11 years ago

I am not a big fan of modifying a user's response. If a user uses the newlines as a logical delimiter, then removing or replacing the newlines with other characters could distort the user's original idea. For example:

"midnight" munchies, snack while i read and  do homework, 
tasty
healthy
tea relaxes me
craving cookies

If we made this one line, it would read:

"midnight" munchies, snack while i read and  do homework, tasty healthy tea relaxes me craving cookies

To me, this has a different connotation and sounds slightly nonsensical, but the original is relatively clear. Even simply replacing the double quotes with single quotes might distort the original thought if the user happened to use both single and double quotes in their response.

jeroen commented 11 years ago

So what did you end up doing? If there are reserved characters inside the text strings, my web front-ends, along with many other clients, will break. I.e. a csv file like this:

user, password
"dirk", "apple"
"peter", "bana
na"
"john", "pear"

A lot of clients will read 4 records instead of 3. Many will throw an error at the parsing that record 2 and 3 have an invalid format. Even worse, some will have a user na" in the dataset. The latter is what will happen with the snackboard, due to the csv parser in D3.

CSV is not a standardized format; rfc4180 is a memo that proposes one way of doing things which is not honored by all software. If you want it to work in most software, you want to be on the safe side and somehow escape the reserved characters (double quote and newline). Excel might be smart, but D3 and open office for example are not.

Our csv output is already highly pragmatic. The way multi_choice and custom_choice elements are encoced as little json objects inside the csv, which is in no way standard and also alters the interpretation. I really think that producing output that breaks many csv readers does not weigh up to the alternative interpretation of replacing \n with \t or "" with '.