vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

[BUG-REPORT] Encoders (Label, Frequency, etc) failing to encode data using the state_set method #1429

Closed BadriPrudhvi closed 3 years ago

BadriPrudhvi commented 3 years ago

Description I am having issues using the state_set method to encode the data.

`

Code in Training Notebook

import vaex

df_train = vaex.open('./data/titanic_train.csv') label_encoder = LabelEncoder(features=['sex']) df_train = label_encoder.fit_transform(df_train)

model_path = './output/titanic_encoder.json' df_train.state_write(model_path)

Code in Inference Notebook

import vaex import json

test = vaex.open('./data/titanic_test.csv')

model_path = './output/titanic_encoder.json'

with open(model_path) as f: model = json.load(f)

test.state_set(model)

test

Screen Shot 2021-06-22 at 2 11 01 PM

`

Error `ERROR:MainThread:vaex:error evaluating: label_encoded_sex at rows 0-5 Traceback (most recent call last): File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 106, in evaluate result = self[expression] File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 166, in getitem raise KeyError("Unknown variables or column: %r" % (variable,)) KeyError: "Unknown variables or column: '_map(sex, map_key_set, map_choices, axis=None)'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 2010, in data_type data = self.evaluate(expression, 0, 1, filtered=False, array_type=array_type, parallel=False) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 2851, in evaluate return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 6099, in _evaluate_implementation value = scope.evaluate(expression) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 106, in evaluate result = self[expression] File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 156, in getitem values = self.evaluate(expression) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 112, in evaluate result = eval(expression, expression_namespace, self) File "", line 1, in File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/arrow/numpy_dispatch.py", line 136, in wrapper result = f(*args, **kwargs) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/functions.py", line 2484, in _map indices = value_to_index.map_ordinal(ar) + 1 AttributeError: 'dict' object has no attribute 'map_ordinal'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 106, in evaluate result = self[expression] File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 166, in getitem raise KeyError("Unknown variables or column: %r" % (variable,)) KeyError: "Unknown variables or column: '_map(sex, map_key_set, map_choices, axis=None)'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 3783, in table_part values = dict(zip(column_names, df.evaluate(column_names))) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 2851, in evaluate return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 6011, in _evaluate_implementation dtypes[expression] = dtype = df.data_type(expression).internal File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 2012, in data_type data = self.evaluate(expression, 0, 1, filtered=True, array_type=array_type, parallel=False) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 2851, in evaluate return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 6099, in _evaluate_implementation value = scope.evaluate(expression) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 106, in evaluate result = self[expression] File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 156, in getitem values = self.evaluate(expression) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 112, in evaluate result = eval(expression, expression_namespace, self) File "", line 1, in File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/arrow/numpy_dispatch.py", line 136, in wrapper result = f(*args, **kwargs) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/functions.py", line 2484, in _map indices = value_to_index.map_ordinal(ar) + 1 AttributeError: 'dict' object has no attribute 'map_ordinal'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 106, in evaluate result = self[expression] File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 166, in getitem raise KeyError("Unknown variables or column: %r" % (variable,)) KeyError: "Unknown variables or column: '_map(sex, map_key_set, map_choices, axis=None)'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 2010, in data_type data = self.evaluate(expression, 0, 1, filtered=False, array_type=array_type, parallel=False) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 2851, in evaluate return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 6099, in _evaluate_implementation value = scope.evaluate(expression) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 106, in evaluate result = self[expression] File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 156, in getitem values = self.evaluate(expression) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 112, in evaluate result = eval(expression, expression_namespace, self) File "", line 1, in File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/arrow/numpy_dispatch.py", line 136, in wrapper result = f(*args, **kwargs) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/functions.py", line 2484, in _map indices = value_to_index.map_ordinal(ar) + 1 AttributeError: 'dict' object has no attribute 'map_ordinal'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 106, in evaluate result = self[expression] File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 166, in getitem raise KeyError("Unknown variables or column: %r" % (variable,)) KeyError: "Unknown variables or column: '_map(sex, map_key_set, map_choices, axis=None)'"

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 3788, in table_part values[name] = df.evaluate(name) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 2851, in evaluate return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 6011, in _evaluate_implementation dtypes[expression] = dtype = df.data_type(expression).internal File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 2012, in data_type data = self.evaluate(expression, 0, 1, filtered=True, array_type=array_type, parallel=False) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 2851, in evaluate return self._evaluate_implementation(expression, i1=i1, i2=i2, out=out, selection=selection, filtered=filtered, array_type=array_type, parallel=parallel, chunk_size=chunk_size) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/dataframe.py", line 6099, in _evaluate_implementation value = scope.evaluate(expression) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 106, in evaluate result = self[expression] File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 156, in getitem values = self.evaluate(expression) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/scopes.py", line 112, in evaluate result = eval(expression, expression_namespace, self) File "", line 1, in File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/arrow/numpy_dispatch.py", line 136, in wrapper result = f(*args, **kwargs) File "/home/prudhvi/.local/lib/python3.7/site-packages/vaex/functions.py", line 2484, in _map indices = value_to_index.map_ordinal(ar) + 1 AttributeError: 'dict' object has no attribute 'map_ordinal'`

Software information

JovanVeljanoski commented 3 years ago

Hi @BadriPrudhvi

Thank you for raising this. However, I am not quite sure if this is a bug. In your inference notebook / script, can you do


import vaex
import json

test = vaex.open('./data/titanic_test.csv')

model_path = './output/titanic_encoder.json'
# Use this instead of reading the json yourself.
test.state_load(model_path)
test

The reason here is that there is some encoding-decoding going that is specific to how vaex works, so if you just read in the json as you did it might miss to decode some stuff. You can look in the source if you are curious about the details.

In any case using df.state_load(...) when the state is written to disk should work.

Please let us know if this helps.

JovanVeljanoski commented 3 years ago

Another way to convince yourself of what I said above is to compare the output of df.state_get() from your training notebook, to the contents of the json that you will read yourself from disk.

JovanVeljanoski commented 3 years ago

Closing this as stale.. please re-open if needed.