Open bo5o opened 3 years ago
I stumbled upon this issue today. I had to transform the data from CSV (with newlines in values/columns) to parquet in order for Trino to read it…
Trino uses the OpenCSVSerde from Hive to read CSV tables and that serde has a number of limitations - documented https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html
This would need to be fixed in the serde.
We're using another CSV Implementation because OpenCSV is extremely slow. But the CSV implementation is not the problem here. I suppose the problem is the RecordReader which is for TextInputFormat just line-based. That means that the RecordReader is searching for a delimiter and then (after that) is parsing a record using a CSVParser.
If my (quick and shallow) code analysis is correct then for CSV values with newlines in it to be parsable Trino needs a completely new RecordReader/TextInputFormat which is CSV-aware.
Overall it shows that CSV is all but a simple format.
Interesting, thanks for digging into the code. But then you loose the splittable nature of current CSV reading mecahanism and you'll be limited to single reader per CSV file instead of having multiple splits read in parallel.
Tradeoffs on both sides it seems.
The RFC has it covered (page 2):
- Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. For example: "aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx
Thanks for the RFC pointer. Note that for Hive connector behavior, the Hive itself is the reference implementation.
Does Apache Hive support CSV files with embedded line breaks? If not, we shouldn't add such a change to Trino.
@findepi its a bad idea to re-implement wrong behaviour just to be compatible with legacy systems. That's what Microsoft did wrong for years. You cannot succeed to Hive if you're doing the same mistakes. Just my 2¢. ;)
Regarding newlines in CSV values in Hive:
It seems Hive cannot handle that. BUT You can define (write) your own INPUTFORMAT
and add this class as a table property. By doing so it is possible to generate correct (like RFC4180) results from wellformed CSV input.
Is it possible to do something like this in Trino? Is it necessary to create a completely new FORMAT
(table property)?
its a bad idea to re-implement wrong behaviour just to be compatible with legacy systems.
that's what Hive connector is.
I agree this isn't awesome path, so I do recommend you try out Iceberg and Delta connectors as well
I encountered with this issue. Do you plan to fix it ?
Using the Hive connector, I am trying to read a CSV which contains cells that have embedded new lines.
The RFC has it covered (page 2):
Here is an example CSV
which I try to query from a table
that returns