osiegmar / FastCSV

CSV library for Java that is fast, RFC-compliant and dependency-free.
https://fastcsv.org/
MIT License
542 stars 93 forks source link

Make the escape character configurable #103

Closed HauserV closed 7 months ago

HauserV commented 8 months ago

Is your feature request related to a problem? Please describe. I would like to import the CSVs into a database using DBeaver, which is impossible because it requires that "the separator, quote, and escape characters must be different".

Describe the solution you'd like I would like to be able to configure the escape character to something other than the double quote.

Describe alternatives you've considered I'm using a different CSV writer library at the moment. This issue prevents me from migrating to FastCSV.

RFC 4180 compliance Non-compliant. It complements the non-compliant CsvWriter.CsvWriterBuilder#quoteCharacter(char) rather well, though.

osiegmar commented 8 months ago

Given a field separator of ;, a quote character of ', an escape character of _ and the default quote strategy (quote only when necessary) – what would be the expected output of the following code?

// CsvWriter csv = ...
csv
  .writeRecord("foo", "bar")
  .writeRecord("foo,bar")
  .writeRecord("foo;bar")
  .writeRecord("with \" char")
  .writeRecord("with ' char")
  .writeRecord("with _ char");
HauserV commented 8 months ago

Thanks for looking into this! To make my use-case work, I would need it to match what OpenCSV does/can parse:

foo;bar
foo,bar
'foo;bar'
with " char
'with _' char'
'with __ char'

While this would solve my problem, forcing quotes around the last line isn't really necessary (e.g., PostgreSQL's COPY would produce with _ char instead; it parses both equally), but I can't think of a sensible selective way to extend the QuoteStrategy to let users pick one or the other. Something like NON_NULL could work for my use-case (I need to differentiate empty strings and nulls), but that would force quotes around everything in this case (a non-issue for me).

QuoteStrategy.REQUIRED/EMPTY

``` foo;bar foo,bar 'foo;bar' with " char 'with _' char' with _ char ``` DBeaver (OpenCSV) parses the last line as `with char` (a showstopper for me; reported as https://github.com/dbeaver/dbeaver/issues/22455). Apache Commons CSV parses it as `with _ char`, just as PostgreSQL does.

QuoteStrategy.ALWAYS/NON_EMPTY/NON_NULL

``` 'foo';'bar' 'foo,bar' 'foo;bar' 'with " char 'with _' char' 'with __ char' ```

What do you think?

osiegmar commented 8 months ago

Thanks for your thoughts.

Out of curiosity I did some analysis and testing today, but haven't decided whether or not to add this feature.

Apart from that, it's hard to believe that DBeaver has a (pretty configurable) CSV import, that actually is unable to handle proper CSV (as defined in RFC 4180). As they apparently use OpenCSV under the hood[^historical], it shouldn't be a problem.

This code uses OpenCSV:

var data = "\"foo \"\"is\"\" bar\",\"bar \\\"is\\\" foo\"";

var builder = new CSVReaderBuilder(new StringReader(data));

var parser = new CSVParserBuilder()
    .withQuoteChar('"')
    .withEscapeChar('\\')
    .build();

var csv = builder.withCSVParser(parser).build();

System.out.println(Arrays.toString(csv.readNext()));

...and the output is:

[foo "is" bar, bar "is" foo]

Just for the record: While this might be your desired behaviour, it's definitely broken according the RFC and one of the reasons for so many unexpected results as shown in https://github.com/osiegmar/JavaCsvComparison.

[^historical]: for historical reasons, of course 😉

HauserV commented 7 months ago

The issue with having two escape characters (let's say the double quote and the backslash) is that one must escape both while writing.

Suppose a pathological string foo "is\" bar. Both OpenCSV and FastCSV write this as "foo ""is\"" bar", just as the RFC says. While FastCSV can easily read this back, OpenCSV cannot recover the original string because it reads both "" and \" as the double quote.

If implemented, FastCSV writer configured to use the backslash as the escape character would produce "foo \"is\\\" bar" (just as PostgreSQL does) which can be easily parsed by OpenCSV.

osiegmar commented 7 months ago

In order to read RFC 4180 conforming data, OpenCSV provides a special RFC4180Parser, since 2017 / version 3.9:

var data = "\"foo \"\"is\\\"\" bar\"";

var builder = new CSVReaderBuilder(new StringReader(data));
var parser = new RFC4180ParserBuilder().build();
var csv = builder.withCSVParser(parser).build();

System.out.println(Arrays.toString(csv.readNext()));

...which gives the expected output of [foo "is\" bar].

If DBeaver does not provide a way to read CSV data that is perfectly valid according to RFC 4180, you may want to open another issue there. Adding a feature to FastCSV that produces "garbled" CSV files solely to enable DBeaver's import capability feels inappropriate. PostgreSQL also perfectly supports importing/exporting RFC conforming data.

...but I still perform testing/evaluation.

osiegmar commented 7 months ago

After spending a decent amount of time digging into this quite interesting feature, I decided against releasing an implementation.

The reasons are:

  1. Lack of standardization or at least consensus.

    • When writing:
      • When to quote? Per RFC, a field that does not contain a field separator, a quote character, or a linebreak does not need to be quoted. Is this list to be extended by the escape character?
      • When to escape? Only in quoted fields or also in unquoted fields?
      • What to escape? Only the quote character or also the escape character? Technically, only the quote character would be necessary.
    • When reading (the file might be from a different source):
      • What is the meaning of an escape character within an unquoted field?
      • Is the escape character also used to escape the escape character itself?
      • How to handle unescaped escape characters (e.g. a \ b)?
      • What characters are escaped (does \bar become bar, \bar or backspace character \b followed by ar)?
      • How to handle the occurrence of doubled double-quotes when a different escape character is used ("mixed mode")?

    While your example has shown a reasonable way to handle it, it differs from how PostgreSQL, MySQL and others are handling it (examples see below). While they also differ from each other.

  2. Contradicts the goals of FastCSV.
    • Performance: Depending on the exact implementation, it's tough to keep the performance high.
    • Compliant: It's a non-compliant feature.
    • Robustness: While technical robustness is not the problem, the feature would be a source of confusion, resulting in never-ending, unresolvable issues. The perceived robustness would suffer.
    • Maintainability: To stay high performant, the complexity raises, the maintainability decreases.

Answering the above (probably incomplete) list of questions may sound simple, but will vary between implementers and users.

The concept of a non-RFC style escape mechanism seems to be a feature that is fairly specific to the implementation. It does not seem to solve interoperability issues, but rather creates new ones. The tickets you've referenced in https://github.com/dbeaver/dbeaver/issues/22483 are a good example of this.

I've also checked some other CSV implementations and found varying behavior, which proves the point that there is no consensus on this topic.

Handling in PostgresSQL

The sole existence of an escape character in a field does not cause it to be escaped:

foo \ bar stays foo \ bar

while

foo "\ bar becomes "foo \"\\ bar"

Also, according to the documentation:

Because backslash is not a special character in the CSV format, ., the end-of-data marker, could also appear as a data value. To avoid any misinterpretation, a . data value appearing as a lone entry on a line is automatically quoted on output, and on input, if quoted, is not interpreted as the end-of-data marker. If you are loading a file created by another application that has a single unquoted column and might have a value of ., you might need to quote that value in the input file.

Handling in MySQL

According to the documentation:

For input, if the FIELDS ESCAPED BY character is not empty, occurrences of that character are stripped and the following character is taken literally as part of a field value. Some two-character sequences that are exceptions, where the first character is the escape character. These sequences are shown in the following table (using \ for the escape character).

The table then shows for example \b that is interpreted as a backspace character.