Open Ben-Epstein opened 2 years ago
Hey Ben,
I can't reproduce this (using master).
@maartenbreddels thinks it might be something to do with the cache perhaps? If you are using it can you maybe clear it?
Hi @JovanVeljanoski thanks for taking a look
I am still getting this, with 4.12.0 {'vaex-core': '4.12.0', 'vaex-hdf5': '0.12.3'}
import vaex
!rm -rf ~/.vaex
with vaex.cache.off():
df = vaex.datasets.titanic()
values = list(str(df.sex.unique()))
print(values)
df = df.ordinal_encode("sex", values=values + [str(i) for i in range(100)], lazy=True)
df["sex"] = df.sex.fillmissing(-1).astype("int32")
df
@JovanVeljanoski your code has a difference in it, the difference that is breaking mine vs yours
You have
values = df.sex.unique()
where I have
values = list(str(df.sex.unique()))
My line breaks each letter into its own value in the list
I am creating way more values than you, for the purposes of breaking the function
Hi @Ben-Epstein !
I looked into this a bit just now and there are more things going on. Let me explain.
The purpose of the value
key in ordinal_encode
is purely for performance reasons, for cases when you know all (or most?) unique values in your column - so you can skip that step. From my point of view, you are abusing this functionality by passing random data to it, unrelated to the column (well, the dtype is the same but other than that.. ). So there are 2 issues with your example:
values
. Those always need to be uniquethis is why my example above used output of the unique
method.
Sorry for the overloaded use of the word "values" here, but I hope you can follow. To summarize (following your example):
values = list(str(df.sex.unique())) # will not work, repeated values
values = list(set(values)) # will not work, not a single value matches samples in data
values += ['female'] # will work, at least one value matches samples in data
values += ['female'] # will not work, repeated values
Hey @JovanVeljanoski thanks for taking a look!
This makes sense to me, and I definitely recognize that I'm abusing the values key. The point of the example was to showcase that vaex doesn't provide a useful error message when this mistake occurs.
This happened to me in production because I had a bug in my code (i was passing in bad values), but I couldn't figure out what was wrong because of the error message.
Seems like, based on your helpful message above, there is a clear set of requirements. Maybe those can be added to the exception upon an invalid input?
Thank you for reaching out and helping us improve Vaex!
Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.
Description Please provide a clear and concise description of the problem. This should contain all the steps needed to reproduce the problem. A minimal code example that exposes the problem is very appreciated.
Software information
import vaex; vaex.__version__)
: vaex-core 4.10.0 vaex-hdf5 0.12.3Additional information