qt4cg / qtspecs

QT4 specifications
https://qt4cg.org/
Other
28 stars 15 forks source link

CSV Parsing - handling line ending normalization #1103

Open michaelhkay opened 6 months ago

michaelhkay commented 6 months ago

During discussion of PR #1066 there was much debate about how best to handle normalization of (typically CRLF) line endings.

Perhaps it's very unlikely that CRLF line endings will make it as far as the parse-csv() function, because they will already have been normalized for example by unparsed-text(). But data can also be read in other ways, for example bin:read-binary() or sql:query() extension functions, or passed in as a string-valued parameter to a transformation.

Perhaps we should have a separate mechanism for normalizing line endings in any data, independent of CSV parsing? (But perhaps it's important to retain CRLF in quoted strings?)

Perhaps CSV parsing should normalise CRLF unconditionally, without needing to set a special option for it?

ChristianGruen commented 6 months ago

Perhaps CSV parsing should normalise CRLF unconditionally, without needing to set a special option for it?

I favor this option (it will probably be difficult or impossible to consider all other input channels). In many cases, it will simply pass the input unchanged.

michaelhkay commented 5 months ago

I propose to close this issue without action; I think the current spec is acceptable.

ChristianGruen commented 5 months ago

I would still love to get rid of the option. The normalization can easily be automated, and I don't see any advantage for the user to control this manually.

In addition, the current default is not satisfying, as it would require users to specify normalize-newlines either for this function or for unparsed-text whenever a Windows text file is parsed.

The normalization should also happen automatically for the pending fn:csv-doc function.

Moreover, the note for fn:csv-to-arrays should be revised:

The default row delimiter is a single newline character U+000A (NEWLINE) . If the content is read using the unparsed-text function, alternative line endings such as CR and CRLF will have been normalized to a single newline. […]