Closed frankvgompel closed 2 years ago
Hi,
Featherstore doesn't support row based predicate filtering based on other columns than the index as of now. Secondary indices is on the list of features i want to add in the future.
Meanwhile, you can filter rows based on values in a specific non-index column like this:
df = store.read_pandas('mps_table', cols=['11 - FHIR code'])
index = df[df == 'Procedure'].index
store.read_polars('mps_table', cols=['12 - Snomed code 1'], rows=index)
The first query will only read the column '11 - FHIR code'
in to memory. While the second query will only read the column '12 - Snomed code 1'
and skip reading partitions not containing rows specified in index
.
Thanks! Did a work-around since I couldn't directly read into pandas. Probably because the feather files were written from polars? Got this error:
[...] File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/dtypes/common.py", line 1777, in pandas_dtype
npdtype = np.dtype(dtype)
TypeError: data type 'large_string' not understood
Also had to tweak the index line.
df = store.read_polars('mps_table', cols=['11 - FHIR code'])
df2 = df.to_pandas()
index = df2[df2['11 - FHIR code'] == 'Procedure'].index
store.read_polars('mps_table', cols=['12 - Snomed code 1'], rows=index)
If you find the time, a bit more explanation in the docs about predicate filtering would be helpful. Looking forward to your future features.
Hi Håkon,
Could you give an example of the filter-predicates for reading partial data? For instance I am trying to return column 12 based on a filter in column 11.
In arrow dataset this would probably be something like:
dataset.to_table(columns=['12 - Snomed code 1'], filter=ds.field('11 - FHIR code') == 'Procedure').to_pandas()
so I unsuccessfully tried variations of:
store.read_polars('mps_table', cols=[('12 - Snomed code 1')], rows=[('11 - FHIR code'== 'Procedure')])