vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 590 forks source link

[FEATURE-REQUEST] Make `ColumnVirtualConstant` and `ColumnVirtualRange` have same API as `Expression` #2295

Open NickCrews opened 1 year ago

NickCrews commented 1 year ago

Description It would be nice if I could treat the result of vrange() and vconstant() as though they were Expressions. e.g. be able to call .astype() or isna() on them.

I think this would make sense because if we interpret ColumnVirtualConstant and ColumnVirtualRange as "implementations of" Expressions, as in they have an "is a" relationship. Am I missing something here, do they actually serve distinct roles?

Additional context It's not a huge deal to get around this by assigning the columns to a DF, and then they are converted to Expressions:

df["x"] = vconstant(...)
df["x"].isna()

but it would be nice to be able to do this directly.

To implement I'm not exactly sure what to do. The simple way would be to leverage the implementation that happens above. However, this isn't super optimized, since it can be deduced before materialization that vconstant(1, length=1_000_000_000).isna() should result in vconstant(False, length=1_000_000_000). Perhaps the best way is to use the simple way by default, but leave the door open to write custom overrides for certain cases if someone desires.

maartenbreddels commented 1 year ago

A Column object plays more the role of an array, it basically has the API of a numpy and arrow array that vaex requires (.dtype, __getitem__, __len__). For arrays we also get the expression API after we add them to a dataframe, so a Column object not having the Expression API I think is consistent with numpy an arrow not having it.

I do agree it would be nice if .isna() could be overridden by the column objects. We could support this at the function.py level, for instance, .isna() already needs to be aware of numpy and arrow. Or we could have a special method in the column object that can override the .isna() similar to NEP13.