substrait-io / substrait

A cross platform way to express data transformation, relational algebra, standardized record expression and plans.
https://substrait.io
Apache License 2.0
1.17k stars 152 forks source link

Add CSV FileFormat in Substrait #174

Open sanjibansg opened 2 years ago

sanjibansg commented 2 years ago

With reference to https://github.com/substrait-io/substrait/issues/138, we can have the implementation for CSV file format by defining the required messages. (Prototype code can be found here)

message CSVConvertOptions{
        bool ignore_check_utf8 = 1;
        repeated string null_values = 2;
        repeated string true_values = 3;
        repeated string false_values = 4;
        bool strings_can_be_null = 5;
        bool quoted_strings_cannot_be_null = 6;
        bool auto_dict_encode = 7;
        int32 auto_dict_max_cardinality = 8;
        string decimal_point = 9;
        repeated string include_columns = 10;
        bool include_missing_columns = 11;
      }

message CSVReadOptions{
        bool no_use_threads = 1;
        int32 block_size = 2;
        int32 skip_rows = 3;
        int32 skip_rows_after_names = 4;
        repeated string column_names = 5;
        bool autogenerate_column_names = 6;
      }

message CSVParseOptions{
        string delimiter = 1;
        bool quoting = 2;
        string quote_char = 3; 
        bool double_quote = 4;
        bool escaping = 5;
        string escape_char = 6;
        bool newlines_in_values = 7;
        bool ignore_empty_lines = 8;
      }

message CSVOptions{
        CSVParseOptions parse_options = 1;
        CSVConvertOptions convert_options = 2;
        CSVReadOptions read_options = 3;
      }

and then the file_type can be defined by one_of,

      oneof file_type{
        FileFormat format = 5;
        CSVOptions csv_options = 6;
      }

We can proceed with this and can then develop a generic implementation using google.protobuf.Any, with separate .proto files defining various file formats.

westonpace commented 2 years ago

Thanks @sanjibansg. I can add a bit of context. These are based on Arrow's CSV reader implementation. There is a similar "giant block of CSV options" in pandas. I think my big question (for the Substrait community) would be whether something like this is in scope of Substrait and, if so, how it should be added?

jacques-n commented 2 years ago

I think it should be partially added to core Substrait. Some of these things seem very arrow specific, some seem very generic (specific: use threads, generic: delimiter).

Let's start by focusing on adding the things that are common to most delimited text readers. Then we can potentially define some structured hints that may be useful but could be ignored. For example, use threads feels like a hint, not a semantic piece of information (implementations could ignore and still provide logically equivalent results). Some of these things also don't really make any sense. For example, I don't know what column names would mean in the context of substrait (and there are several properties focused on this).