Closed kmdalton closed 2 years ago
This seems to be a known issue/decision with the handling of ExtensionDtypes
. Here's a related issue in the pandas repo: https://github.com/pandas-dev/pandas/issues/22791.
This does not seem to be an issue in our specific implementation of the various MTZDtypes
because it also occurs for the pd.Int32Dtype
(which is also implemented as an ExtensionDtype
):
import reciprocalspaceship as rs
import numpy as np
import pandas as pd
ds = rs.DataSet({
"X" : np.arange(100),
"Y" : np.arange(100)
},
dtype=pd.Int32Dtype()
)
print(f"ds.dtypes:\n{ds.dtypes}\n")
print(f"ds[['X', 'Y']].to_numpy().dtype:\n {ds[['X', 'Y']].to_numpy().dtype}")
Output:
ds.dtypes:
X Int32
Y Int32
dtype: object
ds[['X', 'Y']].to_numpy().dtype:
object
Tracing the method calls, it seems that this eventually comes to here: https://github.com/pandas-dev/pandas/blob/122d50246bcffcf8c3f252146340ac02676a5bf6/pandas/core/dtypes/cast.py#L1501
It may be possible to implement our own logic for how various dtypes
get combined by implementing our own MTZDtype._get_common_dtype(types)
method.
My proposed logic for handling this would be the following:
if
all dtypes are isinstance(dtype, MTZInt32Dtype)
and have no NaNs, use np.int32
elif
all dtypes are isinstance(dtype, MTZDtype)
, use np.float32
else
use objectSince MTZ files are all backed by float32, it seems that we should be able to safely fall back to such a case when all dtypes are MTZDtypes
also related: https://github.com/pandas-dev/pandas/issues/22224
Sometimes when converting multiple columns to
numpy
,reciprocalspaceship
converts the columns toobject dtype
.A simple example to reproduce:
outputs: