Closed lgray closed 4 years ago
There's something still taking time within assigning starts and stops.
Ah, CI is failing due to needing to have pyarrow deeper into the mix... I'm fine with waiting for awkward1 for this one if it's too much of a hassle.
I got your comments after writing mine. Let me know if you decide to cancel this PR.
maybe take the offsets passthrough from mine, where I handle the soft type check?
I'll rebase this on master when #221 gets merged.
Ah, there appears to be a large string/unicode/bytes type as well. I will get to those too.
@jpivarski I think I can separate this PR entirely from Nick's. I'll fix that up, is there anything else you want before merge?
I'll merge when you tell me whether you're going to make the suggested change or not (it's optional) and when you tell me that you've coordinated the if-statements in jagged.py with @nsmith-.
This no longer makes changes to jagged.py, so there's no issue with merging. It doesn't replace the user-configurable use_large_index
with a direct detection based on the starts
, stops
integer size. Do you want me to merge it anyway? (I.e. are you punting on that?)
@jpivarski I've implemented your requests locally but it cropped up some problems with parquet and arrow. The former a limitation, the latter may be a bug.
Yep there is no validation for LargeListArray in arrow yet: https://github.com/apache/arrow/blob/e902b24e9de79f18d542e6d29a55ced26b2dc696/cpp/src/arrow/array/validate.cc#L78
But there is for binary array? It's just a completeness issue. Anyway, will comment that out until a fix is made.... I might contribute it.
Ah, no that was false, there was a forced conversion to 32bit offsets for strings and binary. Removed that and now everything works!
@jpivarski I've changed the code such that we never presently serialize into 64-bit offset types, but we may deserialize them when they are encountered.
Once things are a bit more mature on the arrow side we can uncomment the serialization parts.
Nice, thanks! I'll merge this as soon as the tests pass.
There's now a 64-bit indexed ListArray, StringArray, and BinaryArray types in pyarrow.
I've put in a defaulted-false argument in
toarrow()
where the user can drive the index type used when serializing jagged arrays.