Open nvdbaranec opened 9 months ago
Actually, upon further review, this mystery "list" element is in the parquet file itself (it's one of the odd ways in which the spec allows you to specify list columns). A question here though would be what would a user expect to be the correct way to do it. For Pandas or Spark, would you expect to have to put "list" in there when selecting a subset of columns? @jlowe @shwina
The schema for this part of the file is
optional group field_id=-1 func_params (List) {
repeated group field_id=-1 list {
optional group field_id=-1 item {
optional int32 field_id=-1 order;
optional int32 field_id=-1 size;
optional binary field_id=-1 type (String);
}
}
}
Unfortunately unless you can normalize the schema it is not clear because there are multiple ways to encode the schema and it is not "required"
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists
Ideally the repeated group is called "list" but
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
gives a lot of other options
Right, that's the question: are these details something you'd expect the end user to know or care about, or would they just expect "A.B.C"? Maybe this is a what-would-Pandas-do question.
@nvdbaranec It's been a few years, but I believe the way to query in the above situation is to use explode
to convert the list to separate rows. If there were another column at the top of the hierarchy ('X'), then the value for 'X' would be repeated for each new row that the list 'A' was exploded into. Here's a pyspark query I did years ago against the data @hyperbolic2346 quoted above:
df.createOrReplaceTempView("asm")
sql = """
select func_name, func_addr_start, blk_addr_start, blk_id, flatten(sources.asm) as asm from (
select func_name, func_addr_start, bb.blk_addr_start, bb.blk_id, filter(bb.sources,x->x.asm_scrub_type = 'no_scrub') as sources
from (select func_name, func_addr_start, explode(basic_blocks) as bb from asm))
where func_name='introduce'
"""
If you have a schema that contains a list-of-struct, selecting a subset of the inner columns doesn't work. Example
list<struct<int, float>>
If the schema for this column wasAttempting to select "A.B.C" would not work. I believe this is being caused by some schema preprocessing that we are doing that is injecting fake schema elements to ease schema interpretation. Essentially we see a schema that looks like this:
So "A.B.C" doesn't actually exist, only "A.list.B.C" and the code returns 0 columns.