Closed phofl closed 1 week ago
thanks @phofl
I don't think we can do this for selecting multiple column names, because then if you have boolean column names then trying to select columns by name will result in filtering rows by a mask
But we can do it for:
Duplicated columns don't work in Dask fwiw, which makes the boolean case basically non-existent?
there don't need to be duplicates, you could have two columns called [True, False]
, no?
We would remove that selection because it's equal to the number of columns which makes it a no-op
There is technically [True]
and [False]
though
this is what i mean:
In [12]: def func(df, column_names):
...: return df[column_names]
...:
In [13]: func(pd.DataFrame({True: [1,2], False: [4,5]}), [True, False])
Out[13]:
True False
0 1 4
In [14]: func(pd.DataFrame({'a': [1,2], 'b': [4,5]}), ['a', 'b'])
Out[14]:
a b
0 1 4
1 2 5
having said that, the dtype of df.columns
is always known (and inexpensive) in Dask, right?
So, could we just do:
if df.columns.dtype.kind != 'b':
return df[column_names]
else:
return df.loc[:, column_names]
?
(we could probably make such a utility and reuse it for pandas backend too)
Good point
Yeah that would work, boolean columns in Dask don't really work (tried to select something right now and it continues to raise 😅), so you can also not worry about boolean columns much
But yes, accessing the dtype of the columns is cheap
Describe the bug
Loc isn't very well supported in Dask since all the special implementations that pandas has don't work in a distributed environment. I would generally recommend just using getitem if you want to either select columns or apply a filter. Currently, this blocks column projections for some reason and we don't have many tests for loc specific cases
Steps or code to reproduce the bug
run q1 for dask but call .optimize(fuse=False).pprint() instead of .compute()
Expected results
The ReadParquetFSSpec expression should be restricted to a subset of the columns
Actual results
Please run narwhals.show_version() and enter the output below.
Relevant log output
No response