plasma-umass / scalene

Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals
Apache License 2.0
12.24k stars 399 forks source link

Missing memory profiling in a function (with contrast to `memory_profiler`) #167

Open ianozsvald opened 3 years ago

ianozsvald commented 3 years ago

memory_profiler and scalene given different depths of coverage of memory profiling on a simple Pandas example, Scalene gives less information, I think this is a bug.

The sample code is listed in full below. I include full outputs from the code for both profilers. Specifically look at get_mean_for_indicator_poor where memory_profiler identifies line 19 as costing 850MB whilst Scalene identifies nothing. In the get_mean_for_indicator_better function both profilers correctly identify 27 as being expensive.

@emeryberger you may recognise this code as being a variant of https://github.com/pandas-dev/pandas/issues/37139

Scalene output (using Scalene 1.2.4):

                                                                  Memory usage: ▁▁▁▁▄▄▇▇▇████████ (max:   2.07GB, growth rate:  80%)                                                                  
                                                                      dataframe_example.py: % of time = 100.00% out of   3.70s.                                                                       
        ╷        ╷       ╷    ╷       ╷      ╷              ╷       ╷                                                                                                                                 
   Line │Time %  │Time % │Sys │Mem %  │Net   │Memory usage  │Copy   │                                                                                                                                 
        │Python  │native │%   │Python │(MB)  │over time / % │(MB/s) │dataframe_example.py                                                                                                             
╺━━━━━━━┿━━━━━━━━┿━━━━━━━┿━━━━┿━━━━━━━┿━━━━━━┿━━━━━━━━━━━━━━┿━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸
      1 │        │       │    │       │      │              │       │import time                                                                                                                      
      2 │    3%  │   16% │    │    3% │   42 │▁▁▁▁▁▁  3%    │       │import numpy as np                                                                                                               
      3 │    5%  │       │    │    3% │   17 │▁▁▁▁▁  4%     │       │import pandas as pd                                                                                                              
      4 │        │       │    │       │      │              │       │                                                                                                                                 
      5 │        │       │    │       │      │              │       │# this assumes you have memory_profiler installed                                                                                
      6 │        │       │    │       │      │              │       │# if you want to use "@profile" on a function                                                                                    
      7 │        │       │    │       │      │              │       │# if not, we can ignore it with a pass-through decorator                                                                         
      8 │        │       │    │       │      │              │       │if 'profile' not in dir():                                                                                                       
      9 │        │       │    │       │      │              │       │    def profile(fn):                                                                                                             
     10 │        │       │    │       │      │              │       │        return fn                                                                                                                
     11 │        │       │    │       │      │              │       │                                                                                                                                 
     12 │        │       │    │       │      │              │       │SIZE = 10_000_000                                                                                                                
     13 │        │       │    │       │      │              │       │                                                                                                                                 
     14 │        │       │    │       │      │              │       │                                                                                                                                 
     15 │        │       │    │       │      │              │       │@profile                                                                                                                         
     16 │        │       │    │       │      │              │       │def get_mean_for_indicator_poor(df, indicator):                                                                                  
     17 │        │       │    │       │      │              │       │    # poor way to use a groupby here, causes big allocation                                                                      
     18 │        │       │    │       │      │              │       │    gpby = df.groupby('indicator')                                                                                               
     19 │        │   13% │    │       │      │              │       │    means = gpby.mean() # means by column                                                                                        
     20 │        │       │    │       │      │              │       │    means_for_ind = means.loc[indicator]                                                                                         
     21 │        │       │    │       │      │              │       │    total = means_for_ind.sum()                                                                                                  
     22 │        │       │    │       │      │              │       │    return total                                                                                                                 
     23 │        │       │    │       │      │              │       │                                                                                                                                 
     24 │        │       │    │       │      │              │       │@profile                                                                                                                         
     25 │        │       │    │       │      │              │       │def get_mean_for_indicator_better(df, indicator, rnd_cols):                                                                      
     26 │        │       │    │       │      │              │       │    # more memory efficient and faster way to solve this challenge                                                               
     27 │    3%  │   12% │    │       │  240 │▁▁▁▁▁▁▁▁▁ 11% │       │    df_sub = df.query('indicator==@indicator')[rnd_cols]                                                                         
     28 │        │       │    │       │      │              │       │    means_for_ind = df_sub.mean() # means by column                                                                              
     29 │        │       │    │       │      │              │       │    total = means_for_ind.sum() # sum of rows                                                                                    
     30 │        │       │    │       │      │              │       │    return total                                                                                                                 
     31 │        │       │    │       │      │              │       │                                                                                                                                 
     32 │        │       │    │       │      │              │       │                                                                                                                                 
     33 │        │       │    │       │      │              │       │@profile                                                                                                                         
     34 │        │       │    │       │      │              │       │def run():                                                                                                                       
     35 │        │   21% │    │       │  763 │▂ 35%         │       │    arr = np.random.random((SIZE, 10))                                                                                           
     36 │        │       │    │       │      │              │       │    print(f"{arr.shape} shape for our array")                                                                                    
     37 │        │       │    │       │      │              │       │    df = pd.DataFrame(arr)                                                                                                       
     38 │        │       │    │       │      │              │       │    rnd_cols = [f"c_{n}" for n in df.columns]                                                                                    
     39 │        │       │    │       │      │              │       │    df.columns = rnd_cols                                                                                                        
     40 │        │       │    │       │      │              │       │                                                                                                                                 
     41 │        │       │    │       │      │              │       │    # make a big dataframe with an indicator column and lots of random data                                                      
     42 │        │    5% │    │       │  151 │▁▁▁▁  7%      │       │    df2 = pd.DataFrame({'indicator' : np.random.randint(0, 10, SIZE)})                                                           
     43 │        │       │    │       │      │              │       │    # deliberately overwrite the first df                                                                                        
     44 │        │   18% │    │       │  839 │▂▂▂▃ 39%      │       │    df = pd.concat((df2, df), axis=1) # PART OF DEMO - unexpected copy=True forces an expensive copy                             
     45 │        │       │    │       │      │              │       │    print("Head of our df:")                                                                                                     
     46 │        │       │    │       │    0 │▁▁▁▁▁▁        │       │    print(df.head())                                                                                                             
     47 │        │       │    │       │      │              │       │                                                                                                                                 
     48 │        │       │    │       │      │              │       │    print("Print results to check that we get the result")                                                                       
     49 │        │       │    │       │      │              │       │    indicator = 2                                                                                                                
     50 │        │       │    │       │      │              │       │    print(f"Mean for indicator {indicator} on better implementation {get_mean_for_indicator_better(df, indicator, rnd_cols):0.5  
     51 │        │       │    │       │      │              │       │    print(f"Mean for indicator {indicator} on poor implementation: {get_mean_for_indicator_poor(df, indicator):0.5f}")           
     52 │        │       │    │       │      │              │       │                                                                                                                                 
     53 │        │       │    │       │      │              │       │                                                                                                                                 
     54 │        │       │    │       │      │              │       │if __name__ == "__main__":                                                                                                       
     55 │        │       │    │       │      │              │       │    run()                                                                                                                        
        │        │       │    │       │      │              │       │                                                                                                                                 
╶───────┼────────┼───────┼────┼───────┼──────┼──────────────┼───────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╴
        │        │       │    │       │      │              │       │function summary                                                                                                                 
   1386 │    8%  │   16% │    │    6% │   59 │▂▂▂▃▃▁  7%    │       │profile_code                                                                                                                     
     33 │    2%  │   44% │    │       │ 1754 │██████ 82%    │       │run                                                                                                                              
     24 │    3%  │   13% │    │       │  240 │█████████ 11% │       │get_mean_for_indicator_better                                                                                                    
        │        │   13% │    │       │      │              │       │get_mean_for_indicator_poor                                                                                                      
        ╵        ╵       ╵    ╵       ╵      ╵              ╵       ╵                                                                                                                                 
Top net memory consumption, by line:
(1)    44:   839 MB
(2)    35:   763 MB
(3)    27:   240 MB
(4)    42:   151 MB
(5)     2:    42 MB

memory_profiler output:

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    15 1834.512 MiB 1834.512 MiB           1   @profile
    16                                         def get_mean_for_indicator_poor(df, indicator):
    17                                             # poor way to use a groupby here, causes big allocation
    18 1834.512 MiB    0.000 MiB           1       gpby = df.groupby('indicator')
    19 2682.523 MiB  848.012 MiB           1       means = gpby.mean() # means by column
    20 2682.523 MiB    0.000 MiB           1       means_for_ind = means.loc[indicator]
    21 2682.523 MiB    0.000 MiB           1       total = means_for_ind.sum()
    22 2682.523 MiB    0.000 MiB           1       return total

Filename: dataframe_example.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    24 1757.125 MiB 1757.125 MiB           1   @profile
    25                                         def get_mean_for_indicator_better(df, indicator, rnd_cols):
    26                                             # more memory efficient and faster way to solve this challenge
    27 1918.480 MiB  161.355 MiB           1       df_sub = df.query('indicator==@indicator')[rnd_cols]
    28 1918.480 MiB    0.000 MiB           1       means_for_ind = df_sub.mean() # means by column
    29 1918.480 MiB    0.000 MiB           1       total = means_for_ind.sum() # sum of rows
    30 1918.480 MiB    0.000 MiB           1       return total

Filename: dataframe_example.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    33   77.156 MiB   77.156 MiB           1   @profile
    34                                         def run():
    35  840.086 MiB  762.930 MiB           1       arr = np.random.random((SIZE, 10))
    36  840.086 MiB    0.000 MiB           1       print(f"{arr.shape} shape for our array")
    37  840.086 MiB    0.000 MiB           1       df = pd.DataFrame(arr)
    38  840.086 MiB    0.000 MiB          13       rnd_cols = [f"c_{n}" for n in df.columns]
    39  840.086 MiB    0.000 MiB           1       df.columns = rnd_cols
    40                                         
    41                                             # make a big dataframe with an indicator column and lots of random data
    42  916.801 MiB   76.715 MiB           1       df2 = pd.DataFrame({'indicator' : np.random.randint(0, 10, SIZE)})
    43                                             # deliberately overwrite the first df
    44 1756.141 MiB  839.340 MiB           1       df = pd.concat((df2, df), axis=1) # PART OF DEMO - unexpected copy=True forces an expensive copy
    45 1756.141 MiB    0.000 MiB           1       print("Head of our df:")
    46 1757.125 MiB    0.984 MiB           1       print(df.head())
    47                                             
    48 1757.125 MiB    0.000 MiB           1       print("Print results to check that we get the result")
    49 1757.125 MiB    0.000 MiB           1       indicator = 2
    50 1834.512 MiB 1834.512 MiB           1       print(f"Mean for indicator {indicator} on better implementation {get_mean_for_indicator_better(df, indicator, rnd_cols):0.5f}")
    51 1843.402 MiB 1843.402 MiB           1       print(f"Mean for indicator {indicator} on poor implementation: {get_mean_for_indicator_poor(df, indicator):0.5f}")

I'm using Linux Mint 20.1 (Cinnamon):

$ inxi
CPU: Quad Core Intel Core i7-6700HQ (-MT MCP-) speed/min/max: 1412/800/2600 MHz Kernel: 5.4.0-66-generic x86_64 
Up: 19d 21h 34m Mem: 8970.6/31876.3 MiB (28.1%) Storage: 953.87 GiB (50.7% used) Procs: 320 Shell: bash 5.0.17 inxi: 3.0.38

Full source:

import time
import numpy as np
import pandas as pd

# this assumes you have memory_profiler installed
# if you want to use "@profile" on a function
# if not, we can ignore it with a pass-through decorator
if 'profile' not in dir():
    def profile(fn):
        return fn 

SIZE = 10_000_000

@profile
def get_mean_for_indicator_poor(df, indicator):
    # poor way to use a groupby here, causes big allocation
    gpby = df.groupby('indicator')
    means = gpby.mean() # means by column
    means_for_ind = means.loc[indicator]
    total = means_for_ind.sum()
    return total

@profile
def get_mean_for_indicator_better(df, indicator, rnd_cols):
    # more memory efficient and faster way to solve this challenge
    df_sub = df.query('indicator==@indicator')[rnd_cols]
    means_for_ind = df_sub.mean() # means by column
    total = means_for_ind.sum() # sum of rows
    return total

@profile
def run():
    arr = np.random.random((SIZE, 10))
    print(f"{arr.shape} shape for our array")
    df = pd.DataFrame(arr)
    rnd_cols = [f"c_{n}" for n in df.columns]
    df.columns = rnd_cols

    # make a big dataframe with an indicator column and lots of random data
    df2 = pd.DataFrame({'indicator' : np.random.randint(0, 10, SIZE)})
    # deliberately overwrite the first df
    df = pd.concat((df2, df), axis=1) # PART OF DEMO - unexpected copy=True forces an expensive copy
    print("Head of our df:")
    print(df.head())

    print("Print results to check that we get the result")
    indicator = 2
    print(f"Mean for indicator {indicator} on better implementation {get_mean_for_indicator_better(df, indicator, rnd_cols):0.5f}")
    print(f"Mean for indicator {indicator} on poor implementation: {get_mean_for_indicator_poor(df, indicator):0.5f}")

if __name__ == "__main__":
    run()
emeryberger commented 3 years ago

Can you test with the repo version? (Just verified that it works for me; we're planning on releasing this soon as 1.3.1.)

python3 -m pip install -U git+https://github.com/plasma-umass/scalene should do the trick.

emeryberger commented 3 years ago

...and now in version 1.3.1, on pip.

emeryberger commented 3 years ago

Since I believe this is fixed, closing. @ianozsvald please re-open if it's not fixed on your end!

ianozsvald commented 3 years ago

I've upgraded to 1.3.2. I note that on each run I get a different result now - perhaps this is a consequence of using a sampling profiler? Notably:


    15 │       │       │       │       │       │       │              │       │@profile                                                                                                                     
    16 │       │       │       │       │       │       │              │       │def get_mean_for_indicator_poor(df, indicator):                                                                              
    17 │       │       │       │       │       │       │              │       │    gpby = df.groupby('indicator')                                                                                           
    18 │    2% │   10% │       │       │       │829.6M │▃▃▃▃▃▃▃  29%  │       │    means = gpby.mean() # means by column                                                                                    
    19 │       │       │       │       │       │       │              │       │    means_for_ind = means.loc[indicator]                                                                                     
    20 │       │       │       │       │       │       │              │       │    total = means_for_ind.sum()                                                                                              
    21 │       │       │       │       │       │       │              │       │    return total                                                                                                             
    22 │       │       │       │       │       │       │              │       │                                                                                                                             
    23 │       │       │       │       │       │       │              │       │@profile                                                                                                                     
    24 │       │       │       │       │       │       │              │       │def get_mean_for_indicator_better(df, indicator, rnd_cols):                                                                  
    25 │    2% │    9% │       │       │       │416.7M │▁▁▁▁▁▁▁▁  13% │       │    df_sub = df.query('indicator==@indicator')[rnd_cols]                                                                     
    26 │       │       │       │       │       │       │              │       │    means_for_ind = df_sub.mean() # means by column                                                                          
    27 │       │       │       │       │       │       │              │       │    total = means_for_ind.sum() # sum of rows                                                                                
    28 │       │       │       │       │       │       │              │       │    return total  

and


    15 │       │       │       │       │       │       │              │       │@profile                                                                                                                     
    16 │       │       │       │       │       │       │              │       │def get_mean_for_indicator_poor(df, indicator):                                                                              
    17 │       │       │       │       │       │       │              │       │    gpby = df.groupby('indicator')                                                                                           
    18 │       │   11% │       │       │       │498.7M │▃▃▃▃▃▃▂▂  29% │       │    means = gpby.mean() # means by column                                                                                    
    19 │       │       │       │       │       │       │              │       │    means_for_ind = means.loc[indicator]                                                                                     
    20 │       │       │       │       │       │       │              │       │    total = means_for_ind.sum()                                                                                              
    21 │       │       │       │       │       │       │              │       │    return total                                                                                                             
    22 │       │       │       │       │       │       │              │       │                                                                                                                             
    23 │       │       │       │       │       │       │              │       │@profile                                                                                                                     
    24 │       │       │       │       │       │       │              │       │def get_mean_for_indicator_better(df, indicator, rnd_cols):                                                                  
    25 │    2% │    9% │       │       │   1%  │417.5M │▁▁▁▁  13%     │       │    df_sub = df.query('indicator==@indicator')[rnd_cols]                                                                     
    26 │       │       │       │       │       │       │              │       │    means_for_ind = df_sub.mean() # means by column                                                                          
    27 │       │       │       │       │       │       │              │       │    total = means_for_ind.sum() # sum of rows                                                                                
    28 │       │       │       │       │       │       │              │       │    return total  

I'm using the same code as before. get_mean_for_indicator_better seems stable, on 5 runs I get the same answer, get_mean_for_indicator_poor is more variable (I recorded 498M, 829M, 617M, 617M, 948M on 5 consecutive runs).

The Memory usage: .... (max: XXX output for the 5 runs was pretty similar at circa 2.95GB-3.02GB.

Could you comment on the reason for the variability? If it is due to sampling - is there a way to make the sampling occur more frequently? I'm asking partly for academic interest (when I'm teaching, as I have done using scalene recently) and partly because this sort of variability when diagnosing Pandas would hamper efforts of folk to try to figure out what the heck Pandas is doing :-)

emeryberger commented 3 years ago

@ianozsvald thanks for the report - we are looking into it!

emeryberger commented 3 years ago

Belatedly: I believe this has stabilized. @ianozsvald can you give it another try? Right now, there is no way to make sampling occur more frequently but this is something we can look into if the current status isn't quite there. Thanks!

emeryberger commented 3 years ago

Please install from the repo version for now: pip install -U git+https://github.com/plasma-umass/scalene - thanks!

ianozsvald commented 3 years ago

Sorry for the delay, I'm testing in prep for my next Higher Performance Python class (frustratingly on Windows for a pension fund, so I only get a slide demo of Scalene this time).

Using Scalene 1.3.6 the problematic inconsistency has gone away and this is a good example of 5 runs: image

memory_profiler still has different numbers:

    18 2683.125 MiB  847.988 MiB           1       means = gpby.mean() # means by column
...
    25 1918.949 MiB  162.168 MiB           1       df_sub = df.query('indicator==@indicator')[rnd_cols]

I've renamed the function names (variant 1 & 2 rather than better & poor) but otherwise that part hasn't changed.

What has changed is I now also delete a column, Scalene records nothing here (!) but memory_profiler records a 680MB cost (it looks as though the block manager is duplicating columns internally, freeing them later, dependant on how you delete columns and on the state of the block manager...non trivial stuff):

    # Scalene                                                                                                                   
    51 │       │       │       │       │       │       │              │       │    # now start some clean-up pending some further (not yet written) work                                                                                     
    52 │       │       │       │       │       │       │              │       │    print("Prior to cleaning-up we have:", df.info())                                                                                                         
    53 │       │       │       │       │       │       │              │       │    del df['c_0'] # we don't actually need this column so we'll save some RAM                                                                                 
    54 │       │   13% │       │       │       │       │              │   168 │    print("After cleaning up we have:", df.info()) 

   # memory_profiler
    51                                             # now start some clean-up pending some further (not yet written) work
    52 1844.055 MiB    0.000 MiB           1       print("Prior to cleaning-up we have:", df.info())
    53 2530.629 MiB  686.574 MiB           1       del df['c_0'] # we don't actually need this column so we'll save some RAM
    54 2530.629 MiB    0.000 MiB           1       print("After cleaning up we have:", df.info())

There are reports in the Pandas bug tracker using memory_profiler to identify this sort of issue...I wonder if we can confirm it with Scalene or show that actually something else is happening? del ... or df.drop(columns=['c_0'], inplace=True) do the same job and have different memory costs.