[BUG] Parquet column selection by name with schemas including list<struct<X, Y>> does not work.

rapidsai / cudf

cuDF - GPU DataFrame Library

https://docs.rapids.ai/api/cudf/stable/

Apache License 2.0

8.28k stars 884 forks source link

[BUG] Parquet column selection by name with schemas including list<struct<X, Y>> does not work. #14539

Open nvdbaranec opened 9 months ago

nvdbaranec commented 9 months ago

If you have a schema that contains a list-of-struct, selecting a subset of the inner columns doesn't work. Example

list<struct<int, float>> If the schema for this column was

A           (list)
   B        (struct)
       C    (int)
       D    (float)

Attempting to select "A.B.C" would not work. I believe this is being caused by some schema preprocessing that we are doing that is injecting fake schema elements to ease schema interpretation. Essentially we see a schema that looks like this:

A            (list)
  list       (the fake element
     B       (struct)
        C    (int)
        D    (float)

So "A.B.C" doesn't actually exist, only "A.list.B.C" and the code returns 0 columns.

nvdbaranec commented 9 months ago

Actually, upon further review, this mystery "list" element is in the parquet file itself (it's one of the odd ways in which the spec allows you to specify list columns). A question here though would be what would a user expect to be the correct way to do it. For Pandas or Spark, would you expect to have to put "list" in there when selecting a subset of columns? @jlowe @shwina

hyperbolic2346 commented 9 months ago

The schema for this part of the file is

  optional group field_id=-1 func_params (List) {
    repeated group field_id=-1 list {
      optional group field_id=-1 item {
        optional int32 field_id=-1 order;
        optional int32 field_id=-1 size;
        optional binary field_id=-1 type (String);
      }
    }
  }

revans2 commented 9 months ago

Unfortunately unless you can normalize the schema it is not clear because there are multiple ways to encode the schema and it is not "required"

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

Ideally the repeated group is called "list" but

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules

gives a lot of other options

nvdbaranec commented 9 months ago

Right, that's the question: are these details something you'd expect the end user to know or care about, or would they just expect "A.B.C"? Maybe this is a what-would-Pandas-do question.

etseidl commented 9 months ago

@nvdbaranec It's been a few years, but I believe the way to query in the above situation is to use explode to convert the list to separate rows. If there were another column at the top of the hierarchy ('X'), then the value for 'X' would be repeated for each new row that the list 'A' was exploded into. Here's a pyspark query I did years ago against the data @hyperbolic2346 quoted above:

df.createOrReplaceTempView("asm")

sql = """
select func_name, func_addr_start, blk_addr_start, blk_id, flatten(sources.asm) as asm from (
  select func_name, func_addr_start, bb.blk_addr_start, bb.blk_id, filter(bb.sources,x->x.asm_scrub_type = 'no_scrub') as sources
    from (select func_name, func_addr_start, explode(basic_blocks) as bb from asm))
where func_name='introduce'
"""