vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.29k stars 589 forks source link

[FEATURE-REQUEST] Reduce number of functions in state_get if functions are the same #1539

Open fsiola opened 3 years ago

fsiola commented 3 years ago

Description When applying the same function to multiple columns, multiple versions of the same function with same pickled version end up in the state. This can probably be optimised by storing only one version of the function.

Example

import vaex
import pandas as pd

def simple_func(x):
    return x+1

pandas_df = pd.DataFrame({
    "col1": [1,2,3,4],
    "col2": [1,2,3,4],
})

vaex_df = vaex.from_pandas(pandas_df)
vaex_df['col1_1'] = vaex_df['col1'].apply(simple_func)
vaex_df['col2_1'] = vaex_df['col2'].apply(simple_func)
vaex_df.state_get()
{'virtual_columns': {'col1_1': 'lambda_function(col1)',
  'col2_1': 'lambda_function_1(col2)'},
 'column_names': ['col1', 'col2', 'col1_1', 'col2_1'],
 'renamed_columns': [],
 'variables': {},
 'functions': {'lambda_function': {'cls': 'vaex.expression.FunctionToScalar',
   'state': {'pickled': 'gASV8QEAAAAAAACMF2Nsb3VkcGlja2xlLmNsb3VkcGlja2xllIwNX2J1aWx0aW5fdHlwZZSTlIwK\nTGFtYmRhVHlwZZSFlFKUKGgCjAhDb2RlVHlwZZSFlFKUKEsBSwBLAUsCS0NDCHwAZAEXAFMAlE5L\nAYaUKYwBeJSFlIxOL3Zhci9mb2xkZXJzL2diL2t0MjgybjFkMGZsOV9saDlxY244eTc5dzAwMDBn\ncC9UL2lweWtlcm5lbF85NTMyNC8xOTA0NTI5MjYyLnB5lIwLc2ltcGxlX2Z1bmOUSwFDAgABlCkp\ndJRSlH2UKIwLX19wYWNrYWdlX1+UTowIX19uYW1lX1+UjAhfX21haW5fX5R1Tk5OdJRSlIwcY2xv\ndWRwaWNrbGUuY2xvdWRwaWNrbGVfZmFzdJSMEl9mdW5jdGlvbl9zZXRzdGF0ZZSTlGgXfZR9lCho\nFGgOjAxfX3F1YWxuYW1lX1+UaA6MD19fYW5ub3RhdGlvbnNfX5R9lIwOX19rd2RlZmF1bHRzX1+U\nTowMX19kZWZhdWx0c19flE6MCl9fbW9kdWxlX1+UaBWMB19fZG9jX1+UTowLX19jbG9zdXJlX1+U\nTowXX2Nsb3VkcGlja2xlX3N1Ym1vZHVsZXOUXZSMC19fZ2xvYmFsc19flH2UdYaUhlIwLg==\n'}},
  'lambda_function_1': {'cls': 'vaex.expression.FunctionToScalar',
   'state': {'pickled': 'gASV8QEAAAAAAACMF2Nsb3VkcGlja2xlLmNsb3VkcGlja2xllIwNX2J1aWx0aW5fdHlwZZSTlIwK\nTGFtYmRhVHlwZZSFlFKUKGgCjAhDb2RlVHlwZZSFlFKUKEsBSwBLAUsCS0NDCHwAZAEXAFMAlE5L\nAYaUKYwBeJSFlIxOL3Zhci9mb2xkZXJzL2diL2t0MjgybjFkMGZsOV9saDlxY244eTc5dzAwMDBn\ncC9UL2lweWtlcm5lbF85NTMyNC8xOTA0NTI5MjYyLnB5lIwLc2ltcGxlX2Z1bmOUSwFDAgABlCkp\ndJRSlH2UKIwLX19wYWNrYWdlX1+UTowIX19uYW1lX1+UjAhfX21haW5fX5R1Tk5OdJRSlIwcY2xv\ndWRwaWNrbGUuY2xvdWRwaWNrbGVfZmFzdJSMEl9mdW5jdGlvbl9zZXRzdGF0ZZSTlGgXfZR9lCho\nFGgOjAxfX3F1YWxuYW1lX1+UaA6MD19fYW5ub3RhdGlvbnNfX5R9lIwOX19rd2RlZmF1bHRzX1+U\nTowMX19kZWZhdWx0c19flE6MCl9fbW9kdWxlX1+UaBWMB19fZG9jX1+UTowLX19jbG9zdXJlX1+U\nTowXX2Nsb3VkcGlja2xlX3N1Ym1vZHVsZXOUXZSMC19fZ2xvYmFsc19flH2UdYaUhlIwLg==\n'}}},
 'selections': {'__filter__': None},
 'ucds': {},
 'units': {},
 'descriptions': {},
 'description': None,
 'active_range': [0, 4]}

In the previous example, there is no need to lambda_function_1.

maartenbreddels commented 3 years ago

Hi,

Good idea, we need to see if we can test for function equality actually, never looked into that.

Regards,

Maarten