ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
346 stars 175 forks source link

Statistics #53

Open ZJONSSON opened 6 years ago

ZJONSSON commented 6 years ago

Subsequent to https://github.com/ironSource/parquetjs/pull/52

Calculate statistics for each page and each column, including:max_value, min_value, null_count, distinct_count. For any columns that are sorted, the statistics either on column level or page level allows skipping over sections that are not of interest.

ZJONSSON commented 6 years ago

Improved tests required: should capture statistics that are different across pages and row_groups and include null_values and unique_value counts

ZJONSSON commented 6 years ago

Not ready to merge. max_value and min_value have to be encoded with the column encoding

hadrienk commented 5 years ago

Hi,

I see this PR has been pending for almost a year now. Do you need any help? I can test locally or contribute if there's more to do.

dobesv commented 4 years ago

Is there anything I could do to help with this PR?