vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.25k stars 590 forks source link

[BUG-REPORT] bad error message during ordinal encode #2180

Open Ben-Epstein opened 2 years ago

Ben-Epstein commented 2 years ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.

Software information

Additional information

import vaex

df = vaex.datasets.titanic()
values = list(str(df.sex.unique()))
print(values)
df = df.ordinal_encode("sex", values=values + [str(i) for i in range(100)], lazy=True)
df["sex"] = df.sex.fillmissing(-1).astype("int32")
df

image

JovanVeljanoski commented 2 years ago

Hey Ben,

I can't reproduce this (using master).

image

@maartenbreddels thinks it might be something to do with the cache perhaps? If you are using it can you maybe clear it?

Ben-Epstein commented 1 year ago

Hi @JovanVeljanoski thanks for taking a look

I am still getting this, with 4.12.0 {'vaex-core': '4.12.0', 'vaex-hdf5': '0.12.3'}

import vaex
!rm -rf ~/.vaex

with vaex.cache.off():
    df = vaex.datasets.titanic()
    values = list(str(df.sex.unique()))
    print(values)
    df = df.ordinal_encode("sex", values=values + [str(i) for i in range(100)], lazy=True)
    df["sex"] = df.sex.fillmissing(-1).astype("int32")
    df
image
Ben-Epstein commented 1 year ago

@JovanVeljanoski your code has a difference in it, the difference that is breaking mine vs yours

You have values = df.sex.unique() where I have values = list(str(df.sex.unique()))

My line breaks each letter into its own value in the list

I am creating way more values than you, for the purposes of breaking the function

JovanVeljanoski commented 1 year ago

Hi @Ben-Epstein !

I looked into this a bit just now and there are more things going on. Let me explain.

The purpose of the value key in ordinal_encode is purely for performance reasons, for cases when you know all (or most?) unique values in your column - so you can skip that step. From my point of view, you are abusing this functionality by passing random data to it, unrelated to the column (well, the dtype is the same but other than that.. ). So there are 2 issues with your example:

this is why my example above used output of the unique method.

Sorry for the overloaded use of the word "values" here, but I hope you can follow. To summarize (following your example):

values = list(str(df.sex.unique())) # will not work, repeated values

values = list(set(values))  # will not work, not a single value matches samples in data

values += ['female']  # will work, at least one value matches samples in data

values += ['female']  # will not work, repeated values
Ben-Epstein commented 1 year ago

Hey @JovanVeljanoski thanks for taking a look!

This makes sense to me, and I definitely recognize that I'm abusing the values key. The point of the example was to showcase that vaex doesn't provide a useful error message when this mistake occurs.

This happened to me in production because I had a bug in my code (i was passing in bad values), but I couldn't figure out what was wrong because of the error message.

Seems like, based on your helpful message above, there is a clear set of requirements. Maybe those can be added to the exception upon an invalid input?