Closed honno closed 1 year ago
I didn't realise CI has 3.6 and 3.7 jobs. I see https://github.com/vaexio/vaex/pull/1802, so assume these versions want to be generally maintained for? I have used f-stirngs in this PR, although note the df protocol implementation and tests already used them.
Forgot to update some refs of size
(i.e. use size()
instead), so made the necessary updates and wrote a test. Found a pre-existing bug with Column.get_chunks()
that now has some xfailed cases... probably something to fix in a future PR I imagine.
Otherwise I think I've addressed your comments now :)
As an aside, might I ask if you still need to maintain for out-dated Python versions? From a selfish perspective, removing those jobs would bring CI to green :sweat_smile:
Yeah, Python 3.6 is still being used by clients of a client, so I'm still stuck with that :)
Note that 3.6 has f-strings, just not that =
syntax in the f-string.
pyarrow.lib.ArrowInvalid: Dictionary array invalid: Invalid: First or last binary offset out of bounds
This seems like a legit issue :/
pyarrow.lib.ArrowInvalid: Dictionary array invalid: Invalid: First or last binary offset out of bounds
This seems like a legit issue :/
I wasn't even scrolling down the failing jobs heh, didn't notice all win+mac jobs were failing. ~Seems to be a pre-existing problem, looking at the log for an older CI run, so I'd suggest we leave it for a future PR. Ideally we'd xfail those tests for those platforms—is there any platform mechanism for that already exists in vaex that I should use, or shall I try come up with something?~ Seems like an introduced issue in how I conform to the spec changes. Given the OS requirements and my ignorance on vaex/pyarrow here, I'm not particularly motivated to debug this haha, so platform-specific xfails still feel appropriate (seeing this PR is still a plus in that it conforms to the spec and can actually mostly work for a lot of interchange, if just for linux).
Ok, sorry this takes so long, but apart from being very busy, this was difficult to debug.
The issue seems to be that convert_categorical_column
calls convert_string_column
, calling buffer_to_ndarray
which passes the pointer to numpy, but but the nobody has a reference to the _VaexBuffer
anymore, which releases the reference to the arrow array, which then release the underlying memory.
In this specific case, we get the Arrow buffers via bitmap_buffer, offsets, string_bytes = self._col.evaluate().buffers()
which returns an arrow array Vaex' doesn't hold the reference to any more (as will be the case for virtual column as well).
So the question is, using the .ptr
mechanism, who's responsible for holding a reference to the underlying memory?
Looking at the diff again, this seems like it was an underlying issue that's only come up because this PR handles string labels correctly for categoricals? Mind I'm pretty ignorant when it comes to memory management (seems like a nightmare to debug heh).
So if this isn't an introduced bug and what was working still works, I'd recommend to just xfail the failing test cases if you don't have the bandwidth to fix this right now. I did introduce tests covering string behaviour generally, which should help prevent regressions when attempting to fix this.
Yeah, feel free to xfail that test for now, on all platform
Yeah, feel free to xfail that test for now, on all platform
Done!
The interchange protocol had some spec changes recently (https://github.com/data-apis/dataframe-api/pull/74), so this PR namely updates vaex to conform with them.
Column.size()
pandas.interchange
)Column.describe_categorical
categories
as opposed tomapping
, which is aColumn
containing the labelsColumn.describe_categorical
behaviourconvert_categorical_column
There's new behaviour with datatimes (NaTs are now sentinel values), but they first require vaex to support interchanging datetimes generally, so would be better served in a future PR.
cc @maartenbreddels