Open fsiola opened 3 years ago
Description When applying the same function to multiple columns, multiple versions of the same function with same pickled version end up in the state. This can probably be optimised by storing only one version of the function.
Example
import vaex import pandas as pd def simple_func(x): return x+1 pandas_df = pd.DataFrame({ "col1": [1,2,3,4], "col2": [1,2,3,4], }) vaex_df = vaex.from_pandas(pandas_df) vaex_df['col1_1'] = vaex_df['col1'].apply(simple_func) vaex_df['col2_1'] = vaex_df['col2'].apply(simple_func) vaex_df.state_get()
{'virtual_columns': {'col1_1': 'lambda_function(col1)', 'col2_1': 'lambda_function_1(col2)'}, 'column_names': ['col1', 'col2', 'col1_1', 'col2_1'], 'renamed_columns': [], 'variables': {}, 'functions': {'lambda_function': {'cls': 'vaex.expression.FunctionToScalar', 'state': {'pickled': 'gASV8QEAAAAAAACMF2Nsb3VkcGlja2xlLmNsb3VkcGlja2xllIwNX2J1aWx0aW5fdHlwZZSTlIwK\nTGFtYmRhVHlwZZSFlFKUKGgCjAhDb2RlVHlwZZSFlFKUKEsBSwBLAUsCS0NDCHwAZAEXAFMAlE5L\nAYaUKYwBeJSFlIxOL3Zhci9mb2xkZXJzL2diL2t0MjgybjFkMGZsOV9saDlxY244eTc5dzAwMDBn\ncC9UL2lweWtlcm5lbF85NTMyNC8xOTA0NTI5MjYyLnB5lIwLc2ltcGxlX2Z1bmOUSwFDAgABlCkp\ndJRSlH2UKIwLX19wYWNrYWdlX1+UTowIX19uYW1lX1+UjAhfX21haW5fX5R1Tk5OdJRSlIwcY2xv\ndWRwaWNrbGUuY2xvdWRwaWNrbGVfZmFzdJSMEl9mdW5jdGlvbl9zZXRzdGF0ZZSTlGgXfZR9lCho\nFGgOjAxfX3F1YWxuYW1lX1+UaA6MD19fYW5ub3RhdGlvbnNfX5R9lIwOX19rd2RlZmF1bHRzX1+U\nTowMX19kZWZhdWx0c19flE6MCl9fbW9kdWxlX1+UaBWMB19fZG9jX1+UTowLX19jbG9zdXJlX1+U\nTowXX2Nsb3VkcGlja2xlX3N1Ym1vZHVsZXOUXZSMC19fZ2xvYmFsc19flH2UdYaUhlIwLg==\n'}}, 'lambda_function_1': {'cls': 'vaex.expression.FunctionToScalar', 'state': {'pickled': 'gASV8QEAAAAAAACMF2Nsb3VkcGlja2xlLmNsb3VkcGlja2xllIwNX2J1aWx0aW5fdHlwZZSTlIwK\nTGFtYmRhVHlwZZSFlFKUKGgCjAhDb2RlVHlwZZSFlFKUKEsBSwBLAUsCS0NDCHwAZAEXAFMAlE5L\nAYaUKYwBeJSFlIxOL3Zhci9mb2xkZXJzL2diL2t0MjgybjFkMGZsOV9saDlxY244eTc5dzAwMDBn\ncC9UL2lweWtlcm5lbF85NTMyNC8xOTA0NTI5MjYyLnB5lIwLc2ltcGxlX2Z1bmOUSwFDAgABlCkp\ndJRSlH2UKIwLX19wYWNrYWdlX1+UTowIX19uYW1lX1+UjAhfX21haW5fX5R1Tk5OdJRSlIwcY2xv\ndWRwaWNrbGUuY2xvdWRwaWNrbGVfZmFzdJSMEl9mdW5jdGlvbl9zZXRzdGF0ZZSTlGgXfZR9lCho\nFGgOjAxfX3F1YWxuYW1lX1+UaA6MD19fYW5ub3RhdGlvbnNfX5R9lIwOX19rd2RlZmF1bHRzX1+U\nTowMX19kZWZhdWx0c19flE6MCl9fbW9kdWxlX1+UaBWMB19fZG9jX1+UTowLX19jbG9zdXJlX1+U\nTowXX2Nsb3VkcGlja2xlX3N1Ym1vZHVsZXOUXZSMC19fZ2xvYmFsc19flH2UdYaUhlIwLg==\n'}}}, 'selections': {'__filter__': None}, 'ucds': {}, 'units': {}, 'descriptions': {}, 'description': None, 'active_range': [0, 4]}
In the previous example, there is no need to lambda_function_1.
Hi,
Good idea, we need to see if we can test for function equality actually, never looked into that.
Regards,
Maarten
Description When applying the same function to multiple columns, multiple versions of the same function with same pickled version end up in the state. This can probably be optimised by storing only one version of the function.
Example
In the previous example, there is no need to lambda_function_1.