vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.25k stars 590 forks source link

Interchange protocol fixes and updates #2150

Closed honno closed 1 year ago

honno commented 2 years ago

The interchange protocol had some spec changes recently (https://github.com/data-apis/dataframe-api/pull/74), so this PR namely updates vaex to conform with them.

There's new behaviour with datatimes (NaTs are now sentinel values), but they first require vaex to support interchanging datetimes generally, so would be better served in a future PR.

cc @maartenbreddels

honno commented 2 years ago

I didn't realise CI has 3.6 and 3.7 jobs. I see https://github.com/vaexio/vaex/pull/1802, so assume these versions want to be generally maintained for? I have used f-stirngs in this PR, although note the df protocol implementation and tests already used them.

honno commented 2 years ago

Forgot to update some refs of size (i.e. use size() instead), so made the necessary updates and wrote a test. Found a pre-existing bug with Column.get_chunks() that now has some xfailed cases... probably something to fix in a future PR I imagine.

Otherwise I think I've addressed your comments now :)

As an aside, might I ask if you still need to maintain for out-dated Python versions? From a selfish perspective, removing those jobs would bring CI to green :sweat_smile:

maartenbreddels commented 2 years ago

Yeah, Python 3.6 is still being used by clients of a client, so I'm still stuck with that :) Note that 3.6 has f-strings, just not that = syntax in the f-string.

maartenbreddels commented 2 years ago

pyarrow.lib.ArrowInvalid: Dictionary array invalid: Invalid: First or last binary offset out of bounds

This seems like a legit issue :/

honno commented 2 years ago

pyarrow.lib.ArrowInvalid: Dictionary array invalid: Invalid: First or last binary offset out of bounds

This seems like a legit issue :/

I wasn't even scrolling down the failing jobs heh, didn't notice all win+mac jobs were failing. ~Seems to be a pre-existing problem, looking at the log for an older CI run, so I'd suggest we leave it for a future PR. Ideally we'd xfail those tests for those platforms—is there any platform mechanism for that already exists in vaex that I should use, or shall I try come up with something?~ Seems like an introduced issue in how I conform to the spec changes. Given the OS requirements and my ignorance on vaex/pyarrow here, I'm not particularly motivated to debug this haha, so platform-specific xfails still feel appropriate (seeing this PR is still a plus in that it conforms to the spec and can actually mostly work for a lot of interchange, if just for linux).

maartenbreddels commented 1 year ago

Ok, sorry this takes so long, but apart from being very busy, this was difficult to debug.

The issue seems to be that convert_categorical_column calls convert_string_column, calling buffer_to_ndarray which passes the pointer to numpy, but but the nobody has a reference to the _VaexBuffer anymore, which releases the reference to the arrow array, which then release the underlying memory.

In this specific case, we get the Arrow buffers via bitmap_buffer, offsets, string_bytes = self._col.evaluate().buffers() which returns an arrow array Vaex' doesn't hold the reference to any more (as will be the case for virtual column as well).

So the question is, using the .ptr mechanism, who's responsible for holding a reference to the underlying memory?

honno commented 1 year ago

Looking at the diff again, this seems like it was an underlying issue that's only come up because this PR handles string labels correctly for categoricals? Mind I'm pretty ignorant when it comes to memory management (seems like a nightmare to debug heh).

So if this isn't an introduced bug and what was working still works, I'd recommend to just xfail the failing test cases if you don't have the bandwidth to fix this right now. I did introduce tests covering string behaviour generally, which should help prevent regressions when attempting to fix this.

maartenbreddels commented 1 year ago

Yeah, feel free to xfail that test for now, on all platform

honno commented 1 year ago

Yeah, feel free to xfail that test for now, on all platform

Done!