vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.25k stars 590 forks source link

[BUG-REPORT] Expression.astype("uint") works for numpy but not arrow #2191

Open NickCrews opened 2 years ago

NickCrews commented 2 years ago

See added xfailing test: https://github.com/vaexio/vaex/pull/2190/commits/538a5a68db8242995ae27ef96f2cb7ae6e585e2e

JovanVeljanoski commented 1 year ago

Actually I do not thing this is a bug..

Look in at the arrow documentation there is not such thing as uint.

So in your test, if you use .astype(uint64) for instance, things will work..

I guess we could make an alias for uint to be uint64 to account for this.. what do you think @maartenbreddels @NickCrews ?

NickCrews commented 1 year ago

hmm, that makes sense why it doesn't work.

If we were starting from scratch, I might actually lean the opposite way: Make uint fail for BOTH numpy and arrow, and force users to be explicit with asking for uint64. But that would break people, so probably we can't change to that behavior now.

If vaex is trying to be a higher level abstraction that hides the differences between numpy and arrow (I think this would be a great goal, but IDK how attainable it actually is) then I would like the alias proposal. However, if there are other cases where I DO need to know which is the backend for my data (eg https://github.com/vaexio/vaex/pull/2192), then I would prefer if vaex explicitly left things as is and didn't try to do something clever. So IDK, I think it depends on the larger goals.

I'm fine closing this as "not a bug" and just being more explicit in the docstring for astype().

JovanVeljanoski commented 1 year ago

I think we generally agree.

I think the main idea (as much as we can make it) is that an average user should not care or even know whether the data lives in arrow or numpy underneath it all, as long as it is handled via vaex. When you get it out of vaex (like with .values or .to_numpy() for example, that's a different story.

And we do want most obvious things to work out of the box with safe general assumptions. I still think that many users are not so knowledgeable about (py)arrow yet.. so it is nice to have some higher abstraction.

I am curious to hear @maartenbreddels opinion on this , so let's keep this open for now, and thanks for reporting!