osiegmar / FastCSV

CSV library for Java that is fast, RFC-compliant and dependency-free.
https://fastcsv.org/
MIT License
542 stars 93 forks source link

[Feature Request] Add support for custom {Record, Line} delimiter #108

Closed IIvm closed 7 months ago

IIvm commented 7 months ago

Is your feature request related to a problem? Please describe. I am attempting to migrate our internal CSV parser to FastCSV. However, FastCSV does not have support for custom line delimiters. Here are some use cases for this limitation:

Describe the solution you'd like Add support for custom {Record, Line} delimiters.

Describe alternatives you've considered I haven't found any good alternatives in the Java world yet. Apache.commons.csv doesn't support it either, and it's much slower than FastCSV.

RFC 4180 compliance Would this feature comply to RFC 4180? Yes, it only adds support for different line delimiters without breaking the original RFC.

If this idea is acceptable, I would like to submit a pull request.

osiegmar commented 7 months ago

I couldn't find the normative specification of the TPC-H data format. According to the dbgen tool, these are ASCII files containing records that, by default, are separated by a pipe character (|) and terminated by a line-feed character (\n). Several examples are shown in the answers directory.

While this format is not CSV, its similarity should be sufficient for FastCSV to easily read and write such files by configuring the field separator to | (CsvReader.builder().fieldSeparator('|') / CsvWriter.builder().fieldSeparator('|')). If, for any reason, you need to add a field separator at the end of each line when writing such files, simply add one more null field to the record. When reading such files, you could ignore the last field in each record.

You may encounter problems when the data itself contains the field separator |, newline characters, or quotation marks. But the examples I have seen seem not to use these characters.

IIvm commented 7 months ago

Thanks for your feedback! But I think there is still some use cases that need record delimiter, like Snowflake and MySQL both support self-defined record delimiters.

ref: https://docs.snowflake.com/en/sql-reference/sql/create-file-format

RECORD_DELIMITER = 'character' | NONE
Use
Data loading, data unloading, and external tables

Definition
One or more singlebyte or multibyte characters that separate records in an input file (data loading) or unloaded file (data unloading). Accepts common escape sequences or the following singlebyte or multibyte characters:

Singlebyte characters
Octal values (prefixed by \\) or hex values (prefixed by 0x or \x). For example, for records delimited by the circumflex accent (^) character, specify the octal (\\136) or hex (0x5e) value.

Multibyte characters
Hex values (prefixed by \x). For example, for records delimited by the cent (¢) character, specify the hex (\xC2\xA2) value.

The delimiter for RECORD_DELIMITER or FIELD_DELIMITER cannot be a substring of the delimiter for the other file format option (e.g. FIELD_DELIMITER = 'aa' RECORD_DELIMITER = 'aabb').

Is there any way I can implement this feature with FastCSV without the self-defined record delimiter support?

osiegmar commented 7 months ago

Is there any way I can implement this feature with FastCSV without the self-defined record delimiter support?

To make use of custom line/record delimiters with FastCSV, you may create an implementation of java.io.Reader that replaces the record delimiter with the standard line-feed character. Then, pass this customized Reader to the CsvReader. Similarly, achieve the same for the CsvWriter by implementing a custom java.io.Writer that replaces the line-feed character with the record delimiter.

But I think there is still some use cases that need record delimiter, like Snowflake and MySQL both support self-defined record delimiters.

The mere presence of this feature in other implementations does not justify its inclusion in FastCSV. Could you share a concrete use case where this feature would be required in the context of CSV (which is not the case for TPC-H data)? Preferably something with a normative specification.

Currently, I don't see how this feature aligns with the goals of FastCSV.