intake / akimbo

For when your data won't fit in your dataframe
https://akimbo.readthedocs.io
BSD 3-Clause "New" or "Revised" License
33 stars 6 forks source link

Possible fix for merge-to-awkward-record-array when numpy object array present in the dataframe #19

Closed douglasdavis closed 2 years ago

douglasdavis commented 2 years ago

Going from awkward==1.10.1 to https://github.com/scikit-hep/awkward/commit/f3a94128d472f71bc3ba45c709ebc45e26c93bd6 (current HEAD of main) we can no longer do the following:

In [1]: import pandas as pd

In [2]: import awkward._v2 as ak

In [3]: df = pd.DataFrame({"a": ["one", "two"]})

In [4]: ak.Array({"a": df.a.values})
Out[4]: <Array [{a: 'one'}, {a: 'two'}] type='2 * {a: string}'>

In [5]: 

In [5]: import awkward as ak1

In [6]: ak1.__version__
Out[6]: '1.10.1'
doing the same on the latest commit (long traceback)
```python In [1]: import awkward as ak In [2]: ak.__version__ Out[2]: '2.0.0rc1' In [3]: import pandas as pd In [4]: df = pd.DataFrame({"a": ["one", "two"]}) In [5]: ak.Array({"a": df.a.values}) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In [5], line 1 ----> 1 ak.Array({"a": df.a.values}) File ~/.pyenv/versions/3.10.7/envs/dev/lib/python3.10/site-packages/awkward/highlevel.py:218, in Array.__init__(self, data, behavior, with_name, check_valid, backend) 216 for k, v in data.items(): 217 fields.append(k) --> 218 contents.append(Array(v).layout) 219 if length is None: 220 length = len(contents[-1]) File ~/.pyenv/versions/3.10.7/envs/dev/lib/python3.10/site-packages/awkward/highlevel.py:201, in Array.__init__(self, data, behavior, with_name, check_valid, backend) 198 behavior = ak._util.behavior_of(data, behavior=behavior) 200 elif numpy.is_own_array(data): --> 201 layout = ak.operations.from_numpy(data, highlevel=False) 203 elif ak.nplikes.Cupy.is_own_array(data): 204 layout = ak.operations.from_cupy(data, highlevel=False) File ~/.pyenv/versions/3.10.7/envs/dev/lib/python3.10/site-packages/awkward/operations/ak_from_numpy.py:50, in from_numpy(array, regulararray, recordarray, highlevel, behavior) 9 """ 10 Args: 11 array (np.ndarray): The NumPy array to convert into an Awkward Array. (...) 38 See also #ak.to_numpy and #ak.from_cupy. 39 """ 40 with ak._errors.OperationErrorContext( 41 "ak.from_numpy", 42 dict( (...) 48 ), 49 ): ---> 50 return ak._util.from_arraylib( 51 array, regulararray, recordarray, highlevel, behavior 52 ) File ~/.pyenv/versions/3.10.7/envs/dev/lib/python3.10/site-packages/awkward/_util.py:709, in from_arraylib(array, regulararray, recordarray, highlevel, behavior) 706 mask = None 708 if not recordarray or array.dtype.names is None: --> 709 layout = recurse(array, mask) 711 else: 712 contents = [] File ~/.pyenv/versions/3.10.7/envs/dev/lib/python3.10/site-packages/awkward/_util.py:670, in from_arraylib..recurse(array, mask) 665 data = ak.contents.RegularArray( 666 data, array.shape[i], array.shape[i - 1] 667 ) 669 else: --> 670 data = ak.contents.NumpyArray(array) 672 if mask is None: 673 return data File ~/.pyenv/versions/3.10.7/envs/dev/lib/python3.10/site-packages/awkward/contents/numpyarray.py:49, in NumpyArray.__init__(self, data, identifier, parameters, nplike) 46 self._data = nplike.asarray(data) 48 if not isinstance(nplike, ak.nplikes.Jax): ---> 49 ak.types.numpytype.dtype_to_primitive(self._data.dtype) 50 if len(self._data.shape) == 0: 51 raise ak._errors.wrap_error( 52 TypeError( 53 "{} 'data' must be an array, not a scalar: {}".format( (...) 56 ) 57 ) File ~/.pyenv/versions/3.10.7/envs/dev/lib/python3.10/site-packages/awkward/types/numpytype.py:47, in dtype_to_primitive(dtype) 45 out = _dtype_to_primitive_dict.get(dtype) 46 if out is None: ---> 47 raise ak._errors.wrap_error( 48 TypeError( 49 "unsupported dtype: {}. Must be one of\n\n {}\n\nor a " 50 "datetime64/timedelta64 with units (e.g. 'datetime64[15us]')".format( 51 repr(dtype), ", ".join(_primitive_to_dtype_dict) 52 ) 53 ) 54 ) 55 return out TypeError: while calling (from /Users/ddavis/.pyenv/versions/3.10.7/envs/dev/lib/python3.10/site-packages/awkward/highlevel.py, line 201) ak.from_numpy( array = numpy.ndarray(['one' 'two']) regulararray = False recordarray = True highlevel = False behavior = None ) Error details: unsupported dtype: dtype('O'). Must be one of bool, int8, uint8, int16, uint16, int32, uint32, int64, uint64, float32, float64, complex64, complex128, datetime64, timedelta64, float16 or a datetime64/timedelta64 with units (e.g. 'datetime64[15us]') ```

this PR just wraps the object arrays in a Python iter call. It works but I'm wondering if there is a better solution. cc @martindurant @jpivarski