rs-station / reciprocalspaceship

Tools for exploring reciprocal space
https://rs-station.github.io/reciprocalspaceship/
MIT License
28 stars 12 forks source link

Integer-backed columns with NaNs cannot be converted to float dtype #144

Closed JBGreisman closed 2 years ago

JBGreisman commented 2 years ago

Minimal example:

import reciprocalspaceship as rs
ds = rs.DataSet({"int_col": [0, 1, 2, 3]}, dtype="MTZInt")
ds.loc[0, "int_col"] = np.nan
print(ds["int_col"].to_numpy(float))

Traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/Documents/Hekstra_Lab/github/reciprocalspaceship/reciprocalspaceship/commandline/mtzdump.py in <module>
----> 1 ds["int_col"].to_numpy(float)

~/miniconda3/envs/rs/lib/python3.8/site-packages/pandas/core/base.py in to_numpy(self, dtype, copy, na_value, **kwargs)
    511         if is_extension_array_dtype(self.dtype):
    512             # error: Too many arguments for "to_numpy" of "ExtensionArray"
--> 513             return self.array.to_numpy(  # type: ignore[call-arg]
    514                 dtype, copy=copy, na_value=na_value, **kwargs
    515             )

~/Documents/Hekstra_Lab/github/reciprocalspaceship/reciprocalspaceship/dtypes/base.py in to_numpy(self, dtype, copy, na_value)
    122         if self._hasna:
    123             data = self._data.astype(dtype, copy=copy)
--> 124             data[self._mask] = na_value
    125         else:
    126             data = self._data.astype(dtype, copy=copy)

TypeError: float() argument must be a string or a number, not 'NAType'

This error is related to the overloading of the pandas to_numpy() method, and can be fixed by changing the default na_value to np.nan. This is a safe assumption to be making here, because all MTZIntegerArray-backed datatypes have to be compatible with float32 dtypes by construction.

I have a local fix implemented, and will make a PR shortly -- just posting this issue to log the error.

JBGreisman commented 2 years ago

This issue ends up affecting the round-tripping of MTZ files to/from gemmi if they contain NaNs in integer fields. This is very common for phenix output, which often contain NaN entries in the R-free-flags for reflections that were filled (not observed).