Question regarding reading partial data

hakonmh / featherstore

High performance datastore built upon Apache Arrow & Feather

MIT License

5 stars 1 forks source link

Question regarding reading partial data #3

Closed frankvgompel closed 2 years ago

frankvgompel commented 2 years ago

Hi Håkon,

Could you give an example of the filter-predicates for reading partial data? For instance I am trying to return column 12 based on a filter in column 11.

In arrow dataset this would probably be something like:

dataset.to_table(columns=['12 - Snomed code 1'], filter=ds.field('11 - FHIR code') == 'Procedure').to_pandas()

so I unsuccessfully tried variations of:

store.read_polars('mps_table', cols=[('12 - Snomed code 1')], rows=[('11 - FHIR code'== 'Procedure')])

hakonmh commented 2 years ago

Hi,

Featherstore doesn't support row based predicate filtering based on other columns than the index as of now. Secondary indices is on the list of features i want to add in the future.

Meanwhile, you can filter rows based on values in a specific non-index column like this:

df = store.read_pandas('mps_table', cols=['11 - FHIR code'])
index = df[df == 'Procedure'].index
store.read_polars('mps_table', cols=['12 - Snomed code 1'], rows=index)

The first query will only read the column '11 - FHIR code' in to memory. While the second query will only read the column '12 - Snomed code 1' and skip reading partitions not containing rows specified in index.

frankvgompel commented 2 years ago

Thanks! Did a work-around since I couldn't directly read into pandas. Probably because the feather files were written from polars? Got this error:

[...] File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/dtypes/common.py", line 1777, in pandas_dtype
    npdtype = np.dtype(dtype)
TypeError: data type 'large_string' not understood

Also had to tweak the index line.

df = store.read_polars('mps_table', cols=['11 - FHIR code'])
df2 = df.to_pandas()
index = df2[df2['11 - FHIR code'] == 'Procedure'].index
store.read_polars('mps_table', cols=['12 - Snomed code 1'], rows=index)

If you find the time, a bit more explanation in the docs about predicate filtering would be helpful. Looking forward to your future features.