Inefficient use of memory in code: dataframe copies

svdhoog commented 6 years ago

visualization/main.py, line 189-208:

[*]        d = agent_dframes[param['agent']]  # comment: this can be replaced in line below to save memory, here now just for simplicity

        # check if table columns contain the given variables from config file
        for i, entry in enumerate(var_list):
            if not (entry in list(d)):
                erf("Table has columns {0} and var{1}='{2}' does not match.".format(list(d), i+1, entry))

        # stage-I filtering, all input vars are sliced with desired set & run values
[**]   filtered = d.iloc[(d.index.get_level_values('set').isin(param['set'])) & (d.index.get_level_values('run').isin(param['run'])) & (d.index.get_level_values('major').isin(param['major'])) & (d.index.get_level_values('minor').isin(param['minor']))][var_list].dropna().astype(float)

        df_main = pd.DataFrame()
        index1 = 0
        for dkey, dval in var_dic.items():
            df = filter_by_value(dkey, dval, filtered)  # stage-II filtering for selecting variables according to their values
            if df_main.empty:
                df_main = df
            else:
                df_main = pd.concat([df_main, df], axis=1)
[***]       del df

[*] line 189: This appears to make a copy of the entire data frame in memory in the variable d. Can this simply be resolved by copying the RHS of d= and using that in the lines below?

[**] line 197: this appears to create another data frame filtered that is used in the lines below just once, in line 202.

[***] Here df is deleted, which was the filtered data frame that was copied into df_main. Isn't this inefficient copying of data?

svdhoog commented 6 years ago

Proposed change could be:

          # comment: d was replaced by the line below to save memory
[*]       # d = agent_dframes[param['agent']]  

        # check if table columns contain the given variables from config file
        for i, entry in enumerate(var_list):
            if not (entry in list(agent_dframes[param['agent']])):
                erf("Table has columns {0} and var{1}='{2}' does not match.".format(list(agent_dframes[param['agent']]), i+1, entry))

        # stage-I filtering, all input vars are sliced with desired set & run values
[**]   filtered = agent_dframes[param['agent']].iloc[(d.index.get_level_values('set').isin(param['set'])) & (d.index.get_level_values('run').isin(param['run'])) & (d.index.get_level_values('major').isin(param['major'])) & (d.index.get_level_values('minor').isin(param['minor']))][var_list].dropna().astype(float)

        df_main = pd.DataFrame()
        index1 = 0
        # stage-II filtering for selecting variables according to their values
        for dkey, dval in var_dic.items():
            df = filter_by_value(dkey, dval, filtered)  
            if df_main.empty:
                df_main = df
            else:
                df_main = pd.concat([df_main, df], axis=1)
[***]       del df

svdhoog commented 6 years ago

2nd case:

visualization/main.py, line 161-163:

        d = pd.DataFrame()  # Main dataframe to hold all the dataframes of each instance (one agenttype)
        df_list = []
                  ... [constructing df_list]
[*]     d = pd.concat(df_list)  # Add each dataframe from panel into a main dataframe containing all sets and runs
[**]    del df_list
[***]   agent_dframes[agentname] = d  # this dict contains agent-type names as keys, and the corresponding dataframes as values

[*] Here df_list is concatenated/added to d [**] Then it is deleted [***] Now d gets copied into agent_dframes[agentname]

Can [***] not be made more efficient ?

Proposed code change

[***]   agent_dframes[agentname] = pd.concat(df_list) # like at [*] we concat df_list

svdhoog commented 3 years ago

Python does not create entire copies of the data frame in memory. Instead it creates a view in the variable d, and passes by reference here:

d = agent_dframes[param['agent']]

The only inefficiency here is that we are creating a new DataFrame df containing the filtered data that then gets concatenated to df_main:

for dkey, dval in var_dic.items():
            df = filter_by_value(dkey, dval, filtered)  
            if df_main.empty:
                df_main = df
            else:
                df_main = pd.concat([df_main, df], axis=1)
[***]       del df

More efficient implementation By removing the intermittent DataFrame df

for dkey, dval in var_dic.items():
            if df_main.empty:
                df_main = filter_by_value(dkey, dval, filtered)
            else:
                df_main = pd.concat([df_main, filter_by_value(dkey, dval, filtered)], axis=1)

svdhoog / FLAViz

Inefficient use of memory in code: dataframe copies #19