hannes / miniparquet

Library to read a subset of Parquet files
Other
43 stars 7 forks source link

Plan to support recursive data structures? #6

Open MichaelChirico opened 5 years ago

MichaelChirico commented 5 years ago

A lot of my common use cases store map & array data types. It would be great to have support to read such parquet with miniparquet.

Is this out if scope?

hannes commented 5 years ago

Are they stored as nested tables or more complex values? Also, can you provide some sample files please?

MichaelChirico commented 5 years ago

I'm not sure how to answer about their storage, but the Hive type is array and/or map. Though those types are potentially recursive (and hence highly complex), I've only used one-level complexity (e.g. array(int) or map(int, varchar)).

Will try and create something & pass along. Any preferred medium?

hannes commented 5 years ago

medium, e.g. wetransfer?

MichaelChirico commented 5 years ago

yes, or dropbox, i could try gist...

MichaelChirico commented 5 years ago

parquet_test.tar.gz

seems i can upload tar.gz here! i ran the following in SparkR and attached is the compressed output:

# spark start boilerplate
iris = iris
names(iris) = gsub('.', '_', names(iris), fixed = TRUE)
irisSDF = createDataFrame(iris)
irisSDF %>% createOrReplaceTempView('iris')

sql("
select 1 as int, 'a' as str, 1.1 as dbl,
       timestamp('2019-09-20T12:34:56Z') as ts,
       true as bool, date('2019-09-21') as dt,
       map(Species, Sepal_Length) as mp,
       array(Sepal_Width) as arr
from iris
") %>% write.parquet('/path/to/output')
hannes commented 5 years ago

thanks, will see what i can do