substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.18k stars 154 forks source link

feat: add CSV (text) file support #646

Closed EpsilonPrime closed 1 month ago

EpsilonPrime commented 4 months ago

Gluten's version of text file support relies on a named schema which shouldn't be necessary given that by its given nature a text file is comprised of strings. This version doesn't attempt to define a schema or ways to mangle text into that schema.

EpsilonPrime commented 4 months ago

This seems like a reasonably small set of the most common options. I think any arguing about CSV options quickly becomes subjective but I'll pitch the following:

Common options not included here:

  • Some ability to set the column types (are you proposing that casts be used for this purpose?)
  • Compression (many CSV files are compressed with some kind of compression scheme)
  • Newline character (\r, \n, \r\n, though I think this can often be inferred pretty accurately)
  • Skip (some number of lines at the top of the file to skip)

Options included here that don't feel very common:

  • max_block_size strikes me as something only needed in more exotic files

Setting the schema here doesn't seem appropriate. All of the other formats have a schema defined internally. In absence of anything internal definition, the schema for CSVs is actually text. That said there is a base schema in the ReadRel relation itself that one could go try to cast the data to if one wanted to force the schema of the relation for ease of use. I do prefer casting, however.

I've added a comment about compression. All of the compression formats are easily determined by the MAGIC file header so specifying the compression externally seems extraneous.

The header option removes the first row to get the names (which we don't use). I could replace that with a number of lines to skip instead.

I agree that newlines seem to be fairly standard these days and probably don't need to be included as an option.

jacques-n commented 2 months ago

I agree with @westonpace that schema/parsing desires should be included in configuration. As a concrete example, pushing down filters requires data type to be done with CSVs and is pretty common to avoid things like building up large arrow datasets and than deleting most of the records.

I also agree on skip lines. Super common and frequently useful.

Compression also feels like it should be an option but I'm torn on what that should be and would prefer to solve once we have more real use cases.

EpsilonPrime commented 1 month ago

I also agree on skip lines. Super common and frequently useful.

Added skip_lines which will skip lines. The treat_first_row_as_header option is defined to work after that but we could conceivably just not have a treat_first_row_as_header option since we don't use the names in Substrait.

Edit: Decided to remove treat_first_row_as_header option after all.

EpsilonPrime commented 1 month ago

The only topic left that hasn't reached consensus is the type handling. Do we want to assume the return type is all strings (nullable or not) as currently written or should we expect that the reader uses the ReadRel's schema?

jacques-n commented 1 month ago

The only topic left that hasn't reached consensus is the type handling. Do we want to assume the return type is all strings (nullable or not) as currently written or should we expect that the reader uses the ReadRel's schema?

My suggestion is we merge this and that schema/type handling (beyond strings) can be an enhancement. For this, I suggest all strings are nullable.

EpsilonPrime commented 1 month ago

Anyone else interested in weighing in? I believe this PR is ready to merge.