rs-station / reciprocalspaceship

Tools for exploring reciprocal space
https://rs-station.github.io/reciprocalspaceship/
MIT License
29 stars 13 forks source link

Invoking to_numpy on 2D array returns object dtype #33

Closed kmdalton closed 2 years ago

kmdalton commented 3 years ago

Sometimes when converting multiple columns to numpy, reciprocalspaceship converts the columns to object dtype.

A simple example to reproduce:

import reciprocalspaceship as rs
import numpy as np

ds = rs.DataSet({
    "X" : np.random.random(100),
    "Y" : np.random.random(100),
}).infer_mtz_dtypes()

print(f"ds.dtypes:\n{ds.dtypes}\n")
print(f"ds['X'].to_numpy().dtype:\n  {ds['X'].to_numpy().dtype}\n")
print(f"ds['Y'].to_numpy().dtype:\n  {ds['Y'].to_numpy().dtype}\n")
print(f"ds[['X', 'Y']].to_numpy().dtype:\n  {ds[['X', 'Y']].to_numpy().dtype}")

outputs:

ds.dtypes:
X    MTZReal
Y    MTZReal
dtype: object

ds['X'].to_numpy().dtype:
  float32

ds['Y'].to_numpy().dtype:
  float32

ds[['X', 'Y']].to_numpy().dtype:
  object
JBGreisman commented 3 years ago

This seems to be a known issue/decision with the handling of ExtensionDtypes. Here's a related issue in the pandas repo: https://github.com/pandas-dev/pandas/issues/22791.

This does not seem to be an issue in our specific implementation of the various MTZDtypes because it also occurs for the pd.Int32Dtype (which is also implemented as an ExtensionDtype):

import reciprocalspaceship as rs
import numpy as np
import pandas as pd

ds = rs.DataSet({
    "X" : np.arange(100),
    "Y" : np.arange(100)
    },
    dtype=pd.Int32Dtype()
)

print(f"ds.dtypes:\n{ds.dtypes}\n")
print(f"ds[['X', 'Y']].to_numpy().dtype:\n  {ds[['X', 'Y']].to_numpy().dtype}")

Output:

ds.dtypes:
X    Int32
Y    Int32
dtype: object

ds[['X', 'Y']].to_numpy().dtype:
  object
JBGreisman commented 3 years ago

Tracing the method calls, it seems that this eventually comes to here: https://github.com/pandas-dev/pandas/blob/122d50246bcffcf8c3f252146340ac02676a5bf6/pandas/core/dtypes/cast.py#L1501

It may be possible to implement our own logic for how various dtypes get combined by implementing our own MTZDtype._get_common_dtype(types) method.

My proposed logic for handling this would be the following:

Since MTZ files are all backed by float32, it seems that we should be able to safely fall back to such a case when all dtypes are MTZDtypes

JBGreisman commented 3 years ago

also related: https://github.com/pandas-dev/pandas/issues/22224