Long simulations exhaust GPU resources

Probe data is transferred from GPU to host only at the end of a simulation's run_steps. The device has to allocate space for it. For example, a Probe on 1M neurons for 1k timesteps would require 1M x 8k x float32 = 32GB, a full GPU's worth.

This can be solved by changing this

sim.run_steps(8000)

for _ in range(80):
    sim.run_steps(100)

The first one gives an error:

(0) Resource exhausted:  OOM when allocating tensor with shape[8000,1,1000000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

I think you can reproduce that error by taking any stateful network and increasing the simulation steps. It would be helpful to catch that and give the user some more insight.

The second one takes a long time because control is returning to the python interpreter. You always have to return at some point to append the probe data; however, currently, I don't see an option to avoid doing all the checks and standardizations and progress bar launching within sim.run_steps, sim.predict_on_batch, and sim._call_keras on each iteration.

Am I missing something that allows you to efficiently break up the simulation steps? I thought the unroll_simulation argument was it, but that seems to be different. Is there an option in keras to break up the steps, accumulate the data, and return the full data at the end? I'm a novice with keras.

If there is no option to do that, then one idea would be a new method that does those checks once, then has a loop containing only this line https://github.com/nengo/nengo-dl/blob/9fb7854e09e67d4cb9ac553da7c985cf7d97abdc/nengo_dl/simulator.py#L1050 this block https://github.com/nengo/nengo-dl/blob/9fb7854e09e67d4cb9ac553da7c985cf7d97abdc/nengo_dl/simulator.py#L1176 and something like

self._n_steps += steps_per_iter

Nengo OCL has an example of that loop here. It can run full speed basically until virtual memory is exhausted https://github.com/nengo-labs/nengo-ocl/blob/5be4f5416ea4b2564fca1f5ec75bf28bfcc03829/nengo_ocl/simulator.py#L620

Probe data is transferred from GPU to host only at the end of a simulation's run_steps. The device has to allocate space for it. For example, a Probe on 1M neurons for 1k timesteps would require 1M x 8k x float32 = 32GB, a full GPU's worth.

This is kind of an unavoidable aspect of how TensorFlow/Keras are set up. It is designed for situations where you pass a bunch of data to the device, run some computations, and then fetch the results back to the host. The solution you describe (running the simulation in a loop) is the recommended solution, allowing you to customize the simulation length based on your performance/memory needs.

The second one takes a long time because control is returning to the python interpreter. You always have to return at some point to append the probe data; however, currently, I don't see an option to avoid doing all the checks and standardizations and progress bar launching within sim.run_steps, sim.predict_on_batch, and sim._call_keras on each iteration.

The overhead of repeatedly launching simulations is primarily within the getattr(self.keras_model, func_type)(**func_args) call (i.e. it is a property of the underlying TensorFlow/Keras engine), it isn't the other code in e.g. _call_keras that is slowing things down. This relates to the same point above, that TensorFlow/Keras (and therefore by extension NengoDL) are optimized for executing large computational blocks; it is not very efficient for running a lot of small computational blocks. If you have seen some non-trivial overhead in the _call_keras function let us know and we can look into optimizing it.

One thing to keep in mind is that if you don't need the Probe output from every timestep, you can use the Probe.sample_every parameter to subsample the output. In theory this would reduce the memory usage, but due to the current implementation constraints you don't actually see that memory saving until the end of the run_steps block (as discussed here https://forum.nengo.ai/t/nengodl-high-memory-usage/1264/5?u=drasmuss). Do you need the probe output at every timestep, or would sample_every work for you? If so we could prioritize adding the on-device subsampling, since we have a couple reported use cases.

Oh I should also mention that some of the suggestions here https://www.nengo.ai/nengo-dl/tips#speed-improvements should help with running the simulation in a loop (in particular if you can fit an unrolled model in memory then unroll_simulation and use_loop=False should reduce your overhead; inference_only=True is always a good idea if you aren't training the model).

You can also use the keep_history option in nengo_dl.configure_settings to have probes just store the last output (you can further configure this for individual probes; see the docs). If you combine this with an integrative filter (e.g. a long lowpass filter or the perfect integrator nengo.LinearFilter([1], [1, 0])), then you can measure e.g. neuron firing rates over the whole simulation without having to store the history.

One thing to keep in mind is that if you don't need the Probe output from every timestep, you can use the Probe.sample_every parameter to subsample the output.

An on-device version of sample_every would give a new and useful knob in the tradeoff, but it will not solve this issue. Say a user can give up 90% of the data, then their simulation could run 10x longer without looping. If a user can't give up any data, and they want to run 1000x longer, then they still have to use the loop approach.

You can also use the keep_history option in nengo_dl.configure_settings to have probes just store the last output

That might be the way to go to keep device memory at minimum. It's kind of the opposite of on-device sample_every. Are you suggesting the user do a loop like this?

nengo_dl.configure_settings(keep_history=False)
for _ in range(n_steps / like_sample_every):
    sim.run_steps(like_sample_every)

The overhead of repeatedly launching simulations is primarily within the getattr(self.keras_model, func_type)(**func_args) call (i.e. it is a property of the underlying TensorFlow/Keras engine),

I did some profiling on this. Most of the overhead occurs in keras (like you said), unless the number of steps is 16 or less, in which case Nengo DL becomes the bottleneck. The change is in keras. Here's the test:

Experiment

This command runs 500k neurons ten times, then picks the methods of interest out of the profile report.

python -m cProfile -s tottime benchmark_wattsstrogatz.py dl 500_000,500_000,500_000,500_000,500_000,500_000,500_000,500_000,500_000,500_000 | egrep "(just_the_keras_call|_call_keras|predict_on_batch|run_steps|cumtime)"

I put that "getattr" line into its own method (just_the_keras_call), so it would be seen by the profiler.

Output with one iteration of 1000 steps

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       20    0.337    0.017   74.348    3.717 simulator.py:915(just_the_keras_call)
       20    0.000    0.000   74.009    3.700 training.py:1767(predict_on_batch)
       20    0.000    0.000   74.385    3.719 simulator.py:1117(run_steps)
       20    0.000    0.000   74.356    3.718 simulator.py:918(_call_keras)
       20    0.000    0.000   74.357    3.718 simulator.py:689(predict_on_batch)

Output with 100 iterations of 10 steps

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1010    0.104    0.000   94.549    0.094 simulator.py:915(just_the_keras_call)
     1010    0.053    0.000   95.578    0.095 simulator.py:918(_call_keras)
     1010    0.053    0.000   98.510    0.098 simulator.py:1117(run_steps)
     1010    0.031    0.000   93.873    0.093 training.py:1767(predict_on_batch)
     1010    0.011    0.000   95.715    0.095 simulator.py:689(predict_on_batch)

Some more data points:

iterations               1      10    25    50    100     500    1000
steps per iteration      1000   100   40    20    10      2      1
time spent in nengo DL   <.3%   1.7%  3.5%  5.1%  59%     61%    60%

When using the loop approach, 60% of that time is spent in Nengo DL, outside keras, compared to <0.3% when not looping.

Here's a zoom in on that threshold, using a different number of total time steps for divisibility:

iterations               62    62
steps per iteration      17    16
time spent in nengo DL   5.8%  58%

16 steps

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      630    0.048    0.000   89.716    0.142 simulator.py:915(just_the_keras_call)
      630    0.024    0.000   90.174    0.143 simulator.py:918(_call_keras)
      630    0.023    0.000   91.552    0.145 simulator.py:1117(run_steps)
      630    0.016    0.000   89.404    0.142 training.py:1767(predict_on_batch)
      630    0.004    0.000   90.232    0.143 simulator.py:689(predict_on_batch)

17 steps

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      630    0.648    0.001  101.920    0.162 simulator.py:915(just_the_keras_call)
      630    0.013    0.000  102.886    0.163 simulator.py:1117(run_steps)
      630    0.013    0.000  102.160    0.162 simulator.py:918(_call_keras)
      630    0.012    0.000  101.133    0.161 training.py:1767(predict_on_batch)
      630    0.002    0.000  102.191    0.162 simulator.py:689(predict_on_batch)

Keras is significantly faster for 16 compared to 17 steps. Perhaps there is a cost to coordinate multiple registers or something like that. When keras is at that 16-step sweet spot, Nengo DL accounts for a majority of the overhead. It has the effect of dulling the threshold unless you are specifically looking for it. Also, I think it motivates development of a version of run_steps that does not have all the overhead.

I think this is just an artifact of the way you are printing the profiling data. I'm guessing you've got something like

def just_the_keras_call():
    getattr(self.keras_model, func_type)(**func_args)

when you look at the tottime of just_the_keras_call in that case, you're only measuring the time inside that function not including the time inside the inner keras function (i.e., basically just the overhead of calling just_the_keras_call). So that number isn't actually telling you how much time you're spending inside the Keras engine.

If you look at the cumtime you can get a better picture, as that's telling you how much time is spent inside each function including subcalls. So in your case with e.g. 16 steps, you're spending 91.552 seconds in run_steps, and 89.716 seconds of that is inside just_the_keras_call. That means that you're spending 1.836s (i.e. 2%) in NengoDL.

Yes, I think you are right. I was thinking of keras as this opaque call, which can be lumped in with the calling function by cprofile. Of course, that's not the case with keras because it has some non-compiled python before getting into primitive and/or compiled calls. cprofile records those, but they're missed by my grep. Thanks for pointing that out.

There is still this odd and statistically significant threshold between 16 and 17 steps, but we're talking about much smaller 10% instead of 60% differences. What I'm getting from this discussion of overhead is that Nengo DL has a lot of interpreted overhead, but it's not a big deal because keras' is even worse. Is that fair?

I think the 16 vs 17 steps thing is just an environmental quirk on your machine (or at least, it doesn't show up on my machine). There is also a lot less overhead in general on my machine. Here is what it looks like for me with

16 steps

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   248/62    0.000    0.000    5.194    0.084 ...\nengo\nengo\utils\magic.py:174(__call__)
   124/62    0.000    0.000    5.194    0.084 ...\nengo-dl\nengo_dl\simulator.py:65(require_open)
       62    0.002    0.000    5.193    0.084 ...\nengo-dl\nengo_dl\simulator.py:1118(run_steps)
       62    0.000    0.000    5.182    0.084 ...\nengo-dl\nengo_dl\simulator.py:689(predict_on_batch)
   124/62    0.002    0.000    5.181    0.084 ...\nengo-dl\nengo_dl\simulator.py:49(with_self)
       62    0.001    0.000    5.179    0.084 ...\nengo-dl\nengo_dl\simulator.py:915(_call_keras)
       62    0.172    0.003    5.154    0.083 ...\nengo-dl\nengo_dl\simulator.py:1050(just_the_keras_call)

vs 17 steps

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   232/58    0.000    0.000    5.127    0.088 ...\nengo\nengo\utils\magic.py:174(__call__)
   116/58    0.000    0.000    5.127    0.088 ...\nengo-dl\nengo_dl\simulator.py:65(require_open)
       58    0.001    0.000    5.126    0.088 ...\nengo-dl\nengo_dl\simulator.py:1118(run_steps)
       58    0.000    0.000    5.116    0.088 ...\nengo-dl\nengo_dl\simulator.py:689(predict_on_batch)
   116/58    0.001    0.000    5.115    0.088 ...\nengo-dl\nengo_dl\simulator.py:49(with_self)
       58    0.001    0.000    5.113    0.088 ...\nengo-dl\nengo_dl\simulator.py:915(_call_keras)
       58    0.171    0.003    5.090    0.088 ...\nengo-dl\nengo_dl\simulator.py:1050(just_the_keras_call)

Note that that is ~0.7% overhead in both cases.

Here is how I am timing things (just doing it in-code without any grepping, for one 500000 neuron run) in case that makes a difference in your results

        with sim_class(model, **sim_kwargs) as sim:
            t_sim = time.time()

            # -- warmup
            sim.run(0.01)
            t_warm = time.time()

            # -- long-term timing
            with cProfile.Profile() as pr:
                # sim.run(simtime)
                steps_per_iteration = 16
                for _ in range(n_steps // steps_per_iteration):
                    sim.run_steps(steps_per_iteration)
                t_run = time.time()

            pstats.Stats(pr).sort_stats("cumtime").print_stats(20)

So based on my results anyway, I don't think I'd say it's the case that NengoDL has a lot of interpreted overhead, but I'd be interested why your environment seems to have more (although still <2% overhead in the worst case).

Thank you for doing your own tests! Sorry for the delayed response. I agree, based on the cumtime measurement, that Nengo DL is not introducing significant additional overhead. I also learned how to use NVIDIA visual profiler. Not sure how it compares to the TF profiler, but it gives some pretty pictures that correspond to the results above.

Screenshot from 2020-09-13 02-41-04

This is one iteration of the for loop. Keras python overhead is greater than Nengo DL's. There's also something going on in the TF kernel where data is being transferred to host every timestep. I have inference_only=True, so I don't know what it could be. Do you have an idea? Do you have a CUDA-level profiler that reproduces it?

Suppose you did eliminate that kernel dead time, then the Keras python overhead associated with looping would be significant, although it amortizes with the number of steps per loop.

I just found this interesting. I think we can conclude the original issue though. Perhaps a simple try/except block would be helpful for future Nengo DL users. Thanks again.

Just an update, we do plan to add some documentation to help users diagnose/fix memory issues to help resolve this, it'll just be a little while before we have a chance to get to that I suspect.

nengo / nengo-dl

Long simulations exhaust GPU resources #178

Experiment