Add support for embedded line breaks in CSV

trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

https://trino.io

Apache License 2.0

10.27k stars 2.95k forks source link

Add support for embedded line breaks in CSV #8350

Open bo5o opened 3 years ago

bo5o commented 3 years ago

Using the Hive connector, I am trying to read a CSV which contains cells that have embedded new lines.

The RFC has it covered (page 2):

Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example:

"aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx

Here is an example CSV

"aaa","b
bb","ccc"
zzz,yyy,xxx

which I try to query from a table

CREATE TABLE example (
    "col1" VARCHAR, 
    "col2" VARCHAR, 
    "col3" VARCHAR
) with (
    format = 'CSV',
    external_location = '/path/to/csv',
    csv_separator = ',',
    csv_quote='"'
)

select * from example

that returns

col1 | col2 | col3
aaa  |      |      
zzz  | yyy  | xxx

realknorke commented 1 year ago

I stumbled upon this issue today. I had to transform the data from CSV (with newlines in values/columns) to parquet in order for Trino to read it…

hashhar commented 1 year ago

Trino uses the OpenCSVSerde from Hive to read CSV tables and that serde has a number of limitations - documented https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html

This would need to be fixed in the serde.

realknorke commented 1 year ago

We're using another CSV Implementation because OpenCSV is extremely slow. But the CSV implementation is not the problem here. I suppose the problem is the RecordReader which is for TextInputFormat just line-based. That means that the RecordReader is searching for a delimiter and then (after that) is parsing a record using a CSVParser.

If my (quick and shallow) code analysis is correct then for CSV values with newlines in it to be parsable Trino needs a completely new RecordReader/TextInputFormat which is CSV-aware.

Overall it shows that CSV is all but a simple format.

hashhar commented 1 year ago

Interesting, thanks for digging into the code. But then you loose the splittable nature of current CSV reading mecahanism and you'll be limited to single reader per CSV file instead of having multiple splits read in parallel.

Tradeoffs on both sides it seems.

findepi commented 1 year ago

The RFC has it covered (page 2):

Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example: "aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx

Thanks for the RFC pointer. Note that for Hive connector behavior, the Hive itself is the reference implementation.

Does Apache Hive support CSV files with embedded line breaks? If not, we shouldn't add such a change to Trino.

realknorke commented 1 year ago

@findepi its a bad idea to re-implement wrong behaviour just to be compatible with legacy systems. That's what Microsoft did wrong for years. You cannot succeed to Hive if you're doing the same mistakes. Just my 2¢. ;)

Regarding newlines in CSV values in Hive: It seems Hive cannot handle that. BUT You can define (write) your own INPUTFORMAT and add this class as a table property. By doing so it is possible to generate correct (like RFC4180) results from wellformed CSV input. Is it possible to do something like this in Trino? Is it necessary to create a completely new FORMAT (table property)?

findepi commented 1 year ago

its a bad idea to re-implement wrong behaviour just to be compatible with legacy systems.

that's what Hive connector is.

I agree this isn't awesome path, so I do recommend you try out Iceberg and Delta connectors as well

bass2008 commented 3 months ago

I encountered with this issue. Do you plan to fix it ?