plasma-umass / scalene

Scalene: a high-performance, high-precision CPU, GPU, and memory profiler for Python with AI-powered optimization proposals
Apache License 2.0
12.07k stars 395 forks source link

All-files ranking #485

Open jacopok opened 1 year ago

jacopok commented 1 year ago

Currently, scalene profiles are completely broken down by the file in which the code is contained. This is inconvenient for large code bases containing many different files, as was also mentioned in https://github.com/plasma-umass/scalene/issues/132.

I will give an example with a specific use case I have: I am developing mlgw_bns, a software package for gravitational wave data analysis whose whole purpose is to be very fast in giving a prediction for the gravitational waveform emitted by a certain class of sources.

So, there is a method (Model.predict) that needs to be very fast, all else is secondary. Before showing the scalene output, I will show what I would get with cProfile+snakeviz.

To reproduce: create a virtualenv (I did it with python 3.10.6), then pip install mlgw-bns snakeviz.

We then need a script like

from mlgw_bns import Model, ParametersWithExtrinsic
import numpy as np
import cProfile
import pstats

model = Model.default()
par = ParametersWithExtrinsic.gw170817()
freq = np.linspace(20, 2000, num=1000)

# factor out numba jit compilation
model.predict(freq, par)

with cProfile.Profile() as prof:
    model.predict(freq, par)

prof.dump_stats('profile.prof')

Running this generates a profile we can visualize with snakeviz profile.prof from the command line; the output looks something like this:

Screenshot from 2022-11-10 23-12-39

The overall time is about 2.5ms, and I can clearly see the call stack of various functions. I think there is basically no "non-native" time left to optimize out - the long-duration functions are either from libraries such as scipy or decorated with numba, and doing this has brought down the evaluation time significantly. Still, it'd be good to have scalene's output to verify this. Unfortunately, it is quite difficult to interpret in this case.

(A note: the time is quite fast but this evaluation will then need to be repeated millions of times, whence the speed requirement)

So, can scalene help with this task? I found it to be quite tricky. The best I could come up with to accomplish a similar task was to write the script as follows:

from mlgw_bns import Model, ParametersWithExtrinsic
from scalene import scalene_profiler
import numpy as np

model = Model.default()
par = ParametersWithExtrinsic.gw170817()

freq = np.linspace(20, 2000, num=1000)

# factor out numba jit compilation
model.predict(freq, par)

scalene_profiler.start()

for _ in range(2000):
    model.predict(freq, par)

scalene_profiler.stop()

where I am repeating the evaluation thousands of times since, otherwise, the program does not run long enough to profile - reducing the --cpu-sampling-rate does not appear to affect this.

I am then running this with scalene mlgw_to_profile.py --reduced-profile --profile-all --off, and the output looks something like:

Screenshot from 2022-11-10 23-44-25

Screenshot from 2022-11-10 23-44-33

It (correctly, I believe) reports that most time is spent in native code, but that's about it as far as useful output goes. The fact that most time is reported to be spent in the threading module is (I think?) not "fixable", but the reporting in other files is not very usefully framed: since the ranking is by file, there is no way to tell at a glance what the worst lines are.

I am sorry if this issue comes across as confused, I am not sure whether I am messing us somewhere or if scalene cannot help for such a problem.

jacopok commented 1 year ago

Maybe a better title would be "can I get useful output in this case?" The all-files ranking seems like the most directly approachable problem, but it may not be the most impactful...