man-group / ArcticDB

ArcticDB is a high performance, serverless DataFrame database built for the Python Data Science ecosystem.
http://arcticdb.io
Other
1.51k stars 93 forks source link

Allow QueryBuilder Aggregates to be Applied to Whole Columns #1811

Open DrNickClarke opened 2 months ago

DrNickClarke commented 2 months ago

A simple example would be to get the max value in a column without reading all the data.

Missing data (NaNs) should be ignored.

The current workaround is to create a synthetic column with a fixed value and then groupby the new column and apply the aggregator.

This works well but the syntax is not clear enough.

An example of the workaround is

np.random.seed(13)
qb_whole_col_df = pd.DataFrame(data={'val': np.random.uniform(0., 100., 25)})
qb_whole_col_sym = 'qb_whole_col_sym'
lib.write(qb_whole_col_sym, qb_whole_col_df)
q_wc = adb.QueryBuilder()
q_wc = q_wc.apply('zero', q_wc['val']*0).groupby('zero').agg({'val': 'max'})
lib.read(qb_whole_col_sym, query_builder=q_wc).data

In future we will make this possible with cleaner syntax.