Open ianozsvald opened 3 years ago
Can you test with the repo version? (Just verified that it works for me; we're planning on releasing this soon as 1.3.1.)
python3 -m pip install -U git+https://github.com/plasma-umass/scalene
should do the trick.
...and now in version 1.3.1, on pip.
Since I believe this is fixed, closing. @ianozsvald please re-open if it's not fixed on your end!
I've upgraded to 1.3.2. I note that on each run I get a different result now - perhaps this is a consequence of using a sampling profiler? Notably:
15 │ │ │ │ │ │ │ │ │@profile
16 │ │ │ │ │ │ │ │ │def get_mean_for_indicator_poor(df, indicator):
17 │ │ │ │ │ │ │ │ │ gpby = df.groupby('indicator')
18 │ 2% │ 10% │ │ │ │829.6M │▃▃▃▃▃▃▃ 29% │ │ means = gpby.mean() # means by column
19 │ │ │ │ │ │ │ │ │ means_for_ind = means.loc[indicator]
20 │ │ │ │ │ │ │ │ │ total = means_for_ind.sum()
21 │ │ │ │ │ │ │ │ │ return total
22 │ │ │ │ │ │ │ │ │
23 │ │ │ │ │ │ │ │ │@profile
24 │ │ │ │ │ │ │ │ │def get_mean_for_indicator_better(df, indicator, rnd_cols):
25 │ 2% │ 9% │ │ │ │416.7M │▁▁▁▁▁▁▁▁ 13% │ │ df_sub = df.query('indicator==@indicator')[rnd_cols]
26 │ │ │ │ │ │ │ │ │ means_for_ind = df_sub.mean() # means by column
27 │ │ │ │ │ │ │ │ │ total = means_for_ind.sum() # sum of rows
28 │ │ │ │ │ │ │ │ │ return total
and
15 │ │ │ │ │ │ │ │ │@profile
16 │ │ │ │ │ │ │ │ │def get_mean_for_indicator_poor(df, indicator):
17 │ │ │ │ │ │ │ │ │ gpby = df.groupby('indicator')
18 │ │ 11% │ │ │ │498.7M │▃▃▃▃▃▃▂▂ 29% │ │ means = gpby.mean() # means by column
19 │ │ │ │ │ │ │ │ │ means_for_ind = means.loc[indicator]
20 │ │ │ │ │ │ │ │ │ total = means_for_ind.sum()
21 │ │ │ │ │ │ │ │ │ return total
22 │ │ │ │ │ │ │ │ │
23 │ │ │ │ │ │ │ │ │@profile
24 │ │ │ │ │ │ │ │ │def get_mean_for_indicator_better(df, indicator, rnd_cols):
25 │ 2% │ 9% │ │ │ 1% │417.5M │▁▁▁▁ 13% │ │ df_sub = df.query('indicator==@indicator')[rnd_cols]
26 │ │ │ │ │ │ │ │ │ means_for_ind = df_sub.mean() # means by column
27 │ │ │ │ │ │ │ │ │ total = means_for_ind.sum() # sum of rows
28 │ │ │ │ │ │ │ │ │ return total
I'm using the same code as before. get_mean_for_indicator_better
seems stable, on 5 runs I get the same answer, get_mean_for_indicator_poor
is more variable (I recorded 498M, 829M, 617M, 617M, 948M on 5 consecutive runs).
The Memory usage: .... (max: XXX
output for the 5 runs was pretty similar at circa 2.95GB-3.02GB.
Could you comment on the reason for the variability? If it is due to sampling - is there a way to make the sampling occur more frequently? I'm asking partly for academic interest (when I'm teaching, as I have done using scalene recently) and partly because this sort of variability when diagnosing Pandas would hamper efforts of folk to try to figure out what the heck Pandas is doing :-)
@ianozsvald thanks for the report - we are looking into it!
Belatedly: I believe this has stabilized. @ianozsvald can you give it another try? Right now, there is no way to make sampling occur more frequently but this is something we can look into if the current status isn't quite there. Thanks!
Please install from the repo version for now: pip install -U git+https://github.com/plasma-umass/scalene
- thanks!
Sorry for the delay, I'm testing in prep for my next Higher Performance Python class (frustratingly on Windows for a pension fund, so I only get a slide demo of Scalene this time).
Using Scalene 1.3.6 the problematic inconsistency has gone away and this is a good example of 5 runs:
memory_profiler
still has different numbers:
18 2683.125 MiB 847.988 MiB 1 means = gpby.mean() # means by column
...
25 1918.949 MiB 162.168 MiB 1 df_sub = df.query('indicator==@indicator')[rnd_cols]
I've renamed the function names (variant 1 & 2 rather than better & poor) but otherwise that part hasn't changed.
What has changed is I now also delete a column, Scalene records nothing here (!) but memory_profiler
records a 680MB cost (it looks as though the block manager is duplicating columns internally, freeing them later, dependant on how you delete columns and on the state of the block manager...non trivial stuff):
# Scalene
51 │ │ │ │ │ │ │ │ │ # now start some clean-up pending some further (not yet written) work
52 │ │ │ │ │ │ │ │ │ print("Prior to cleaning-up we have:", df.info())
53 │ │ │ │ │ │ │ │ │ del df['c_0'] # we don't actually need this column so we'll save some RAM
54 │ │ 13% │ │ │ │ │ │ 168 │ print("After cleaning up we have:", df.info())
# memory_profiler
51 # now start some clean-up pending some further (not yet written) work
52 1844.055 MiB 0.000 MiB 1 print("Prior to cleaning-up we have:", df.info())
53 2530.629 MiB 686.574 MiB 1 del df['c_0'] # we don't actually need this column so we'll save some RAM
54 2530.629 MiB 0.000 MiB 1 print("After cleaning up we have:", df.info())
There are reports in the Pandas bug tracker using memory_profiler
to identify this sort of issue...I wonder if we can confirm it with Scalene or show that actually something else is happening? del ...
or df.drop(columns=['c_0'], inplace=True)
do the same job and have different memory costs.
memory_profiler
andscalene
given different depths of coverage of memory profiling on a simple Pandas example, Scalene gives less information, I think this is a bug.The sample code is listed in full below. I include full outputs from the code for both profilers. Specifically look at
get_mean_for_indicator_poor
wherememory_profiler
identifies line 19 as costing 850MB whilst Scalene identifies nothing. In theget_mean_for_indicator_better
function both profilers correctly identify 27 as being expensive.@emeryberger you may recognise this code as being a variant of https://github.com/pandas-dev/pandas/issues/37139
Scalene output (using Scalene 1.2.4):
memory_profiler
output:I'm using Linux Mint 20.1 (Cinnamon):
Full source: