scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
831 stars 86 forks source link

`ak.records_to_regular` to convert `[{"x": 1, "y": 2}, {"x": 3, "y": 4}]` into `[[1, 2], [3, 4]]` #3257

Open jpivarski opened 3 weeks ago

jpivarski commented 3 weeks ago

Description of new feature

Awkward Array's idiomatic form for data points with named features is to use RecordArray, which keeps each record field in a separate array (useful for loading or working with a subset of columns).

Machine learning libraries like to see a feature-set (an input vector into a neural network) as a regular dimension, either RegularArray or NumpyArray with inner_shape != () (which become the same thing after conversion out of Awkward). Unlike a RecordArray, the different features of the same vector are contiguous in memory.

Also unlike a RecordArray, the elements of a feature vector have no names. I do not know if there's a way to preserve these feature names, in PyTorch for instance, but it would be nice to do so in a conversion from Awkward Arrays into PyTorch Tensors.

ak.records_to_regular in which the records are one level deep,

>>> array = ak.Array([[{"pt": 0.0, "eta": 1.1}, {"pt": 2.2, "eta": 3.3}], [], [{"pt": 4.4, "eta": 5.5}]])

can be implemented as

>>> ak.unflatten(ak.concatenate(ak.unzip(array), axis=1), 2, axis=1)
<Array [[[0, 2.2], [1.1, 3.3]], ..., [[4.4, ...]]] type='3 * var * 2 * float64'>

but we're interested in a function that can be applied regardless of how deep the first level of records is. It would be written with recursively_apply. At some level of recursively_apply, you'd have passed through the list-type node and would be seeing the RecordArray directly:

>>> array = ak.Array([{"pt": 0.0, "eta": 1.1}, {"pt": 2.2, "eta": 3.3}, {"pt": 4.4, "eta": 5.5}])

and then you'd want to do something like

>>> ak.concatenate([x[:, np.newaxis] for x in ak.unzip(array)], axis=1)
<Array [[0, 1.1], [2.2, 3.3], [4.4, 5.5]] type='3 * 2 * float64'>

(preserves the length, 3, so it's good for recursively_apply).

This function would be useful for Awkward → ML conversions regardless of whether the data are ragged or not.

If more than one RecordArray is nested within each other, this function can be applied multiple times to turn each record-type into a dimension.

jpivarski commented 2 weeks ago

Cc: @livaage, @GageDeZoort, @maxymnaumchyk, @ianna