Closed kmdalton closed 2 years ago
In my mind, this is a behavior that was made by pandas
so I hesitate to overload it to change it. For example, this method also doesn't differentiate between np.int64
and pd.Int64Dtype
(the nullable pandas int64 implementation):
In [17]: df = pd.DataFrame(np.arange(12).reshape(3, 4),
...: columns=['A', 'B', 'C', 'D'])
...: df["A"] = df["A"].astype(pd.Int64Dtype())
In [18]: df.dtypes
Out[18]:
A Int64 <----- Note capitalization
B int64
C int64
D int64
dtype: object
In [19]: df.select_dtypes(pd.Int64Dtype())
Out[19]:
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
I do agree that their documentation doesn't make it clear that this operates to return columns based on the underlying numpy dtype though. It is always possible to get this behavior with a list comprehension, so if we do want this sort of method, I would rather implement it as a custom DataSet
method rather than overload the DataFrame
one. Something like this, but with added support to handle inputting the dtype as str or object:
def select_mtzdtype(self, dtype):
return self[[k for k in self if isinstance(self.dtypes[k], dtype)]]
I'm totally happy with your proposed solution.
pd.DataFrame
has a methodselect_dtypes
which returns columns matching a particularnumpy
dtype
. In the context ofrs
it'd be natural for this to support differentiating customMTZDtype
's. However, this is not the case right now.Given an example
mtz
file,with dtypes
rs.DataSet.select_dtypes
appears to fallback to thenumpy
dtype
. For instance, when I call,mtz.select_dtypes("G")
I expectrs
to return aDataSet
orview
containing only"F(+)"
and"F(-)"
columns. Instead, I get all the columns backed bynp.float32
which is all columns in this case.
Making this behave as expected either requires a change to the underlying
pandas
method or overloading the method inrs
. From this perspective, it might be better to raise this issue with thepandas
devs. Not sure.