sorry to bother, I'm trying to figure out if/how it's possible to create an empty Record with a specific datatype, to be able to save in parquet.
basically, I have something like this, which works
j1 = awkward.from_numpy(np.ones(1, np.int32))
r = awkward.Record({"d": j1})
awkward.to_parquet(r, "test.parquet")
but sometimes, depending on the data, the array j1 is empty, in which case, the export to parquet fails
j1 = awkward.from_numpy(np.empty(0, np.int32))
r = awkward.Record({"d": j1})
awkward.to_parquet(r, "test.parquet")
#fails with "NullType Arrow field must be nullable"
#how to do this, specifying the datatype as above?
j1 = awkward.from_iter([[], []])
awkward.to_parquet({"d": j1}, "test.parquet")
My response
@jpata You've found some quirks in how ak.Records get constructed that should get fixed before the API gets frozen today or tomorrow (in the 2.0.0 release). So, good timing!
What's weird about these records is their data type. You want it to be integer type with zero entries, but it's an unknown type. The reason for that is that the ak.Record(dict(...)) constructor is iterating over the data in the dict because it sees it as generic Python objects to be interpreted with ak.from_iter. With generic Python objects, if a list is empty, the type of the data in that list is unknown.
By contrast, the ak.Array constructor recognizes "dict of arrays" as a special case, in which the arrays are taken to be columns. We call this the "Pandas-style constructor" because it's what you'd expect when constructing a Pandas DataFrame. Arbitrary data in an ak.Array constructor (neither an array nor a dict of arrays, but some other Python objects, including lists) invokes ak.from_iter.
But we should add a special case to the ak.Record constructor to match the special case in the ak.Array constructor so that you can do this with ak.Record({"d": j1}). The case for doing this for ak.Record is even stronger than the case for doing it with ak.Array, since the Pandas-style ak.Array constructor takes data in a SOA form and makes it (virtually) AOS, a change in structure, but there would be no difference for an equivalent ak.Record constructor (there's no "A" here).
The next step, actually writing this to Parquet, works:
but the subsequent step, reading it back with ak.from_parquet, doesn't because of a pyarrow.lib.ArrowInvalid error. It might be a missing case in pyarrow: Parquet files with only one record in them are weird. That's another thing that I'll look into, though it might land in version 2.0.1 or 2.0.2. (It's not an API-changing thing.)
From @jpata on https://gitter.im/Scikit-HEP/awkward-array
sorry to bother, I'm trying to figure out if/how it's possible to create an empty Record with a specific datatype, to be able to save in parquet.
basically, I have something like this, which works
but sometimes, depending on the data, the array j1 is empty, in which case, the export to parquet fails
what's the right way to fix this?
maybe this is a bit clearer
My response
@jpata You've found some quirks in how ak.Records get constructed that should get fixed before the API gets frozen today or tomorrow (in the 2.0.0 release). So, good timing!
What's weird about these records is their data type. You want it to be integer type with zero entries, but it's an unknown type. The reason for that is that the
ak.Record(dict(...))
constructor is iterating over the data in thedict
because it sees it as generic Python objects to be interpreted withak.from_iter
. With generic Python objects, if a list is empty, the type of the data in that list isunknown
.By contrast, the ak.Array constructor recognizes "dict of arrays" as a special case, in which the arrays are taken to be columns. We call this the "Pandas-style constructor" because it's what you'd expect when constructing a Pandas DataFrame. Arbitrary data in an ak.Array constructor (neither an array nor a dict of arrays, but some other Python objects, including lists) invokes
ak.from_iter
.So you could get an ak.Record with a field that is a length-zero list of integers by
But we should add a special case to the ak.Record constructor to match the special case in the ak.Array constructor so that you can do this with
ak.Record({"d": j1})
. The case for doing this for ak.Record is even stronger than the case for doing it with ak.Array, since the Pandas-style ak.Array constructor takes data in a SOA form and makes it (virtually) AOS, a change in structure, but there would be no difference for an equivalent ak.Record constructor (there's no "A" here).The next step, actually writing this to Parquet, works:
but the subsequent step, reading it back with
ak.from_parquet
, doesn't because of apyarrow.lib.ArrowInvalid
error. It might be a missing case in pyarrow: Parquet files with only one record in them are weird. That's another thing that I'll look into, though it might land in version 2.0.1 or 2.0.2. (It's not an API-changing thing.)