scikit-hep / awkward-0.x

Manipulate arrays of complex data structures as easily as Numpy.
BSD 3-Clause "New" or "Revised" License
215 stars 39 forks source link

Add LargeListArray, LargeBinaryArray, LargeStringArray to pyarrow serialization options. #222

Closed lgray closed 4 years ago

lgray commented 4 years ago

There's now a 64-bit indexed ListArray, StringArray, and BinaryArray types in pyarrow.

I've put in a defaulted-false argument in toarrow() where the user can drive the index type used when serializing jagged arrays.

lgray commented 4 years ago

There's something still taking time within assigning starts and stops.

lgray commented 4 years ago

Ah, CI is failing due to needing to have pyarrow deeper into the mix... I'm fine with waiting for awkward1 for this one if it's too much of a hassle.

jpivarski commented 4 years ago

I got your comments after writing mine. Let me know if you decide to cancel this PR.

nsmith- commented 4 years ago

maybe take the offsets passthrough from mine, where I handle the soft type check?

lgray commented 4 years ago

I'll rebase this on master when #221 gets merged.

lgray commented 4 years ago

Ah, there appears to be a large string/unicode/bytes type as well. I will get to those too.

lgray commented 4 years ago

@jpivarski I think I can separate this PR entirely from Nick's. I'll fix that up, is there anything else you want before merge?

jpivarski commented 4 years ago

I'll merge when you tell me whether you're going to make the suggested change or not (it's optional) and when you tell me that you've coordinated the if-statements in jagged.py with @nsmith-.

jpivarski commented 4 years ago

This no longer makes changes to jagged.py, so there's no issue with merging. It doesn't replace the user-configurable use_large_index with a direct detection based on the starts, stops integer size. Do you want me to merge it anyway? (I.e. are you punting on that?)

lgray commented 4 years ago

@jpivarski I've implemented your requests locally but it cropped up some problems with parquet and arrow. The former a limitation, the latter may be a bug.

lgray commented 4 years ago

Yep there is no validation for LargeListArray in arrow yet: https://github.com/apache/arrow/blob/e902b24e9de79f18d542e6d29a55ced26b2dc696/cpp/src/arrow/array/validate.cc#L78

But there is for binary array? It's just a completeness issue. Anyway, will comment that out until a fix is made.... I might contribute it.

lgray commented 4 years ago

Ah, no that was false, there was a forced conversion to 32bit offsets for strings and binary. Removed that and now everything works!

lgray commented 4 years ago

@jpivarski I've changed the code such that we never presently serialize into 64-bit offset types, but we may deserialize them when they are encountered.

Once things are a bit more mature on the arrow side we can uncomment the serialization parts.

jpivarski commented 4 years ago

Nice, thanks! I'll merge this as soon as the tests pass.