scikit-hep / awkward

Manipulate JSON-like data with NumPy-like idioms.
https://awkward-array.org
BSD 3-Clause "New" or "Revised" License
848 stars 89 forks source link

ak.Record should have the same Pandas-style constructor that ak.Array has #1978

Closed jpivarski closed 1 year ago

jpivarski commented 1 year ago

From @jpata on https://gitter.im/Scikit-HEP/awkward-array

sorry to bother, I'm trying to figure out if/how it's possible to create an empty Record with a specific datatype, to be able to save in parquet.

basically, I have something like this, which works

j1 = awkward.from_numpy(np.ones(1, np.int32))
r = awkward.Record({"d": j1})
awkward.to_parquet(r, "test.parquet")

but sometimes, depending on the data, the array j1 is empty, in which case, the export to parquet fails

j1 = awkward.from_numpy(np.empty(0, np.int32))
r = awkward.Record({"d": j1})
awkward.to_parquet(r, "test.parquet")
#fails with "NullType Arrow field must be nullable"

what's the right way to fix this?

maybe this is a bit clearer

#works
j1 = awkward.from_iter([[1], [2]])
awkward.to_parquet({"d": j1}, "test.parquet")
#how to do this, specifying the datatype as above?
j1 = awkward.from_iter([[], []])
awkward.to_parquet({"d": j1}, "test.parquet")

My response

@jpata You've found some quirks in how ak.Records get constructed that should get fixed before the API gets frozen today or tomorrow (in the 2.0.0 release). So, good timing!

What's weird about these records is their data type. You want it to be integer type with zero entries, but it's an unknown type. The reason for that is that the ak.Record(dict(...)) constructor is iterating over the data in the dict because it sees it as generic Python objects to be interpreted with ak.from_iter. With generic Python objects, if a list is empty, the type of the data in that list is unknown.

>>> j1 = ak.from_numpy(np.empty(0, np.int32))
>>> j1
<Array [] type='0 * int32'>
>>> ak.Record({"d": j1})
<Record {d: []} type='{d: var * unknown}'>

By contrast, the ak.Array constructor recognizes "dict of arrays" as a special case, in which the arrays are taken to be columns. We call this the "Pandas-style constructor" because it's what you'd expect when constructing a Pandas DataFrame. Arbitrary data in an ak.Array constructor (neither an array nor a dict of arrays, but some other Python objects, including lists) invokes ak.from_iter.

>>> ak.Array({"d": j1})
<Array [] type='0 * {d: int32}'>

So you could get an ak.Record with a field that is a length-zero list of integers by

>>> ak.Array({"d": j1[np.newaxis]})[0]
<Record {d: []} type='{d: 0 * int32}'>

But we should add a special case to the ak.Record constructor to match the special case in the ak.Array constructor so that you can do this with ak.Record({"d": j1}). The case for doing this for ak.Record is even stronger than the case for doing it with ak.Array, since the Pandas-style ak.Array constructor takes data in a SOA form and makes it (virtually) AOS, a change in structure, but there would be no difference for an equivalent ak.Record constructor (there's no "A" here).

The next step, actually writing this to Parquet, works:

>>> ak.to_parquet(ak.Array({"d": j1[np.newaxis]})[0], "/tmp/test.parquet")
<pyarrow._parquet.FileMetaData object at 0x7fa2b81772c0>
  created_by: parquet-cpp-arrow version 9.0.0
  num_columns: 1
  num_rows: 1
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 0

but the subsequent step, reading it back with ak.from_parquet, doesn't because of a pyarrow.lib.ArrowInvalid error. It might be a missing case in pyarrow: Parquet files with only one record in them are weird. That's another thing that I'll look into, though it might land in version 2.0.1 or 2.0.2. (It's not an API-changing thing.)

jpivarski commented 1 year ago

I'm stealing this back because I had already started and it will be very quick.