vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 590 forks source link

[BUG-REPORT] vconstant(val, dtype) is bad at converting #2286

Closed NickCrews closed 1 year ago

NickCrews commented 1 year ago

Description

import vaex
import numpy as np
import pandas as pd
df = vaex.from_arrays(x=[0,1,2])
# If I uncomment out any of these lines, I get various errors
# I want "5"
# df["str_int"] = vaex.vconstant(5, len(df), dtype=str)  # pyarrow.lib.ArrowTypeError: Expected bytes, got a 'int' object
# I want <missing>
# df["str_np_nan"] = vaex.vconstant(np.nan, len(df), dtype=str)  # pyarrow.lib.ArrowTypeError: Expected bytes, got a 'float' object
# I want <missing>
# df["str_pd_NA"] = vaex.vconstant(pd.NA, len(df), dtype=str)  # TypeError: Could not calculate fingerprint for (<NA>, 3, <U0, 1024)/{}
# I want <missing>
# df["int_pd_NA"] = vaex.vconstant(pd.NA, len(df), dtype="int32")  # TypeError: Could not calculate fingerprint for (<NA>, 3, int32, 1024)/{}
# I want <missing>
# df["int_np.nan"] = vaex.vconstant(np.nan, len(df), dtype="int32")  # pyarrow.lib.ArrowInvalid: Could not convert nan with type float: tried to convert to int32
df

Is this a bug? These all seem like very basic things someone would want to do.

What my motivating use case is "str_np_nan" or "str_pd_NA", I want a vconstant of string dtype filled with <missing>. Is there a workaround? The rest of the examples are exploring to figure out the scope of the problem.

Software information

JovanVeljanoski commented 1 year ago

Hi,

thanks for the report.

For the 1st case, it is equivalent of doing

import pyarrow as pa

x = pa.array([5] * 3, type=pa.string())   # And this is kind of how this is implemented under the hood, i.e. what is evaluated when required.

I guess we can check under the hood if things kind of make sense, and cast the value of vconstant to the appropriate type in cases such as this. Dunno if this should be called a bug or or a user error. Maybe we can account for it in this case. Maybe you'd like to open a PR on this?

For the other cases: pandas types probably will never be supported - they are pandas specific.

For the numpy nan case: i would call this a missing feature. There is currently no good way (at least that I know of) that allows you to introduce missing/masked/NA/nan values. It has been requested in the past for example , given a condition to mask or cast to nan a certain value (say within the where method). I would really like to have this feature, but i don't know how to do it :). Perhaps @maartenbreddels has an idea.

NickCrews commented 1 year ago

Thanks for the explanation.

Perhaps the implementation could be changed to

import pyarrow as pa
import pyarrow.compute as pc

val = 5
length = 3
dtype = pa.string()

MISSING_VALS = {
    pd.NA,
    np.nan,
    None,
}

if val in MISSING_VALS:
    val = None

# pyarrow will be smart enough to cast to whatever type is required
x = pa.array([val] * length)
# Again, let pyarrow do the casting logic.
# IDK if we need to convert "str" -> pa.string(), probably we would,
# but it appears that this conversion already occurs given your example.
pc.cast(x, dtype)

I can try a PR if that looks like the right idea. This algorithm will do extra work if it begins as an int32 say and then has to be cast to a int64, perhaps could just use this new algorithm as a fallback.

OK, skipping pandas types in fine. I just want missing vals. Specifying with np.nan or None both work for me.

Using .where() to set nulls is great, but even more basic is just df["x"] = None/np.nan to set the whole column to missing. I really think this is a missing basic feature. I can help here if I get some pointers.

JovanVeljanoski commented 1 year ago

Hi,

Yeah i like your idea. I did an implementation following your example: not sure if it will be accepted - @maartenbreddels will review. But the thing work!

There might be an edge case for arrow types which do not have python equivalent (if there are such types?).

One other comment: in vaex we try to make a distinction between missing values and not a number aka nan. Missing values are represented in pure python as None and in numpy / arrow as masked arrays. np.nan is a special float value. I know the latter has traditionally been used to indicate missing data, but we see it differently (more like invalid data rather than missing or absent - meaning like a measurement was made but something has gone wrong vs measurement was not taken at all. Sometimes this matters!).

In cases where you do not care about this distinction vaex implements N/A (e.g. countna, dropna, etc..) which is the union of the two cases above.

Also please feel free to comment/review the PR!

Note

df['aa] = vaex.vconstant(np.nan, len(df), dtype) will work when dtype is float, and also for strings, but then the value will be cast to a string so you will get a "nan" "word". I hope this makes sense (t it kind of late here..).

NickCrews commented 1 year ago

The PR looks good, except it doesn't pass a test, so not sure what will need to get tweaked to fix that.

IDK about edge cases for types that might not work. We'll find out if people file bugs, but until then at least we can be happy we're moving in the right direction.

re: different NA types, yes I appreciate that you thought of this and that you are treating them differently. I'm being sloppy with my vocab, I care about strings so I mean missing.

re: your example, it's not obvious to me what vaex.vconstant(np.nan, len(df), str) should return, either "nan" as it currently does or . Both seem like reasonable options. As long as vaex.vconstant(None, len(df), str) results in then I'm happy to just use that unambiguous method.

JovanVeljanoski commented 1 year ago

Thanks! You are right, we can do it in steps, as needed.

As for the last point, that is what happens when you do pc.cast(np.nan, pa.string()). But that makes sense to me, one is explicitly casting a float to a string, it would be the same if you do it with pc.cast(5.3, pa.string()), it will be turned into "5.3".

Cheers!

NickCrews commented 1 year ago

hmmm, you're right when I think about it more, np.nan getting turned to "nan" makes sense.

Looks like the PR got merged and if I update to vaex-4.16.0 then things work! Thank you for another quick fix!