Lack of parallelization of step_current_generator

RCagnol commented 2 years ago

Describe the bug The step_current_generators contribution in the simulation time seem to scale very badly with the number of threads used, in comparison to the contribution of neurons such as iaf_cond_exp. We reported the same issue on the mailing list here: https://www.nest-initiative.org/mailinglist/hyperkitty/list/users@nest-simulator.org/thread/NRWWS6X6JA53FCVOK2N7EXIFYYXGXNXI/

To Reproduce Steps to reproduce the behavior:

import numpy as np
import matplotlib.pyplot as plt
import nest
import nest.voltage_trace
import time
import sys

# All times given in milliseconds
dt = 0.5
dt_rec = 1.
t_end = 2000. # Simulation time

NL         = int(sys.argv[1]) # number of  neurons
NS         = int(sys.argv[2]) # number of step_current_source created
numThreads = int(sys.argv[3]) # number of threads

nest.set_verbosity("M_WARNING")
nest.ResetKernel()
nest.SetKernelStatus({'local_num_threads': numThreads}) # old syntax is nest.local_num_threads = numThreads
nest.resolution = dt
nest.print_time = True

t0 = 0

# Parameters of the neurons
tau_ex_L = 1.5  # in ms
tau_in_L = 10.0  # in ms
tau_m_L = 10.0
C_m_L = 0.29*1000 #nF to pF conversion
g_L = C_m_L/tau_m_L

paramsL = {
    'V_m': -70.0,# initial membrane potential
    'E_L': -70.0 , # v_rest, Leak reversal potential
    'C_m': C_m_L, # cm
    't_ref': 2.0, # tau_refrac
    'V_th': -57.0, # v_thresh
    'V_reset': -70.0, # v_reset
    'E_ex': 0.0, # e_rev_E
    'E_in': -75.0, # e_rev_I
    'g_L' : g_L, # cm/tau_m_L,
    'tau_syn_ex': tau_ex_L, # tau_syn_E
    'tau_syn_in': tau_in_L, # tau_syn_I
    'I_e': 0.0, #Constant input current
}
## Create NL neurons
lgn_pop = nest.Create('iaf_cond_exp', NL)
nest.SetStatus(lgn_pop,paramsL)

## create time list for step current
sdt = 7
amplitude_tims = np.arange(sdt,t_end,sdt)
amplitude_vals = [600.]*len(amplitude_tims)

## inject the spike current into the neurons
nest_stepcurrent = nest.Create('step_current_generator', NS, params={"amplitude_values": amplitude_vals, "amplitude_times": amplitude_tims, "start": 0., "stop": t_end})
con_params_ll = {"rule": "fixed_indegree", "indegree": 25}
nest.Connect(nest_stepcurrent, lgn_pop, con_params_ll)

## run the simulation
nest.rng_seed = 1
time_start = time.time()
nest.Simulate(t_end + dt_rec)
time_evaluate = time.time()

## print the results
print('Simulation time: {:.3f} s'.format(time_evaluate-time_start))
print('Number of lgn neurons: {}'.format(NL))
print('Number of step current sources: {}'.format(NS))

Expected behavior One would expect the simulation time of multiple step_current_generator to scale the same way with multithreading as with the numbers of neurons in the network. i.e., when simulating the network for several values of NS number of step_current_generator and NL number of neurons such and using a linear regression on the simulation time such as: simulation time = a NL + b NS (see the regression example plot for an example of this)

Then the slowdown ratio b/a should be roughly constant independently of the number of thread use. Instead, it increases greatly as the number of threads increases, showing that additional step_current_generators slow the simulation more (relatively to additional neurons) when a high number of threads is used than when a low number of threads in use. The slowdown ratio is independent on the interval between 2 changes of current in the step_current_generators (variable sdt in the code, see the interval_dependance screenshot for an illustration).

Screenshots Regression example: slowdown_example

Slowdown ratio interval dependance: interval_dependence

Desktop/Environment (please complete the following information):

OS: Ubuntu-20.04
Number of physical cores available on the machine: 32
Python-Version: Python 3.8
NEST-Version: nest-3.1
Installation: Built from source

heplesser commented 2 years ago

@RCagnol I have started to look at this problem now and hope to get back to you with a first analysis shortly.

heplesser commented 2 years ago

@RCagnol How many neurons and sources did you use for the line graph of the slowdown ratio above? Your data was produced with NEST 3.1. Does the problem persist with NEST 3.3?

RCagnol commented 2 years ago

This graph was computed using all combinations of values of 30, 2500, 5000, 7500 and 10 000 for both the number of current sources and number of neurons, such that n_neurons >= n_current_sources would always be true.

It seems that removing combinations which didn't meet this criterion of n_neurons >= n_current_sources tended to make the fitting less precise and overestimate this slowdown ratio, so when comparing nest 3.1 and nest 3.3 this week, I kept all combinations of these values. It seems that there's no difference between Nest 3.1 and Nest 3.3 and that the problem still persists:

intervaldependence

heplesser commented 2 years ago

So the data shown for a given thread number is the average across all combinations of n_neurons and n_current_sources? Could you provide graphs for a few n_neurons x n_current_sources combinations? Then I could run those combinations to see if I can reproduce your observations without having to run all combinations (with an eye to saving time and electricity). Could you also add markers to your line plots so it is entirely clear which thread numbers you used? It looks like 1, 8, 16, 24, 32?

RCagnol commented 2 years ago

No, it is not an average. This slowdown ratio is calculated as the ratio b/a such as: simulation time = a n_neurons + b n_sources So to calculate the slowdown ratio for a given number of thread, we'd run a simulation for each combination of (n_neurons, n_sources) that I mentioned earlier, and do a 2D regression to obtain a and b. The goal with that is to evaluate how much the simulation time scales with n_neurons and n_sources for each given number of thread. This line plot shows that as the number of thread increases, this slowdown ratio increases and therefore the simulation time scales much more with the number of sources (than with the number of neurons) when many threads are used. So therefore, I cannot provide these line plots for a few combinations as this slowdown ratio cannot be calculated with a single combination (I hope my explanation was clear).

These were the number of threads we used indeed, but anyway here is a modified version of the plot where it is more apparent:

intervaldependence

heplesser commented 2 years ago

Thank you for your detailed explanation. I understand your approach now. It is a rather different way of analysing scaling than analyses we have done so far, so I didn't grasp it before.

Generally speaking, it is recommendable to keep the number of generators (of any kind) as small as possible. Each generator is replicated once per virtual process, in your case thus one per thread, so that generators can send events directly to receiving neurons without having to go through the central event queue.

If I understand you script right, you have a large number of different input current profiles which you would like to inject into different neurons in different combinations (each neuron receiving 25 randomly selected superimposed current profiles). Would it be possible for you to compute the superimposed current profiles for each neuron in Python and then create one step_current_generator per neuron with the relevant profile?

If that is not an alternative we would need to discuss in more detail what you want to achieve and how we could implement that efficiently in NEST.

RCagnol commented 2 years ago

Actually our usual use case would be indeed having current sources and neurons connected one-to-one. This in degree of 25 is just a remnant of some tests we did to determine whether this bad scaling with the number of current sources was due to the generation of the current itself or by the step of injecting it to the neurons itself (I have to say that I don't have much insight on how nest works so I don't know how relevant is this distinction) . I completely forgot to modify it back, sorry for this.

To detail our use case more, we have a model of V1 in our lab, in which the layer 4 receives input from some neurons representing the LGN. At each timestep, each single LGN neuron receives a current from a single current source as an input, this current being calculated as the convolution between the stimuli and the kernel/receptive field of the neuron. What we observed is that our simulation time was very dependent of the number of such LGN neurons, while the V1 neurons are much more numerous. That's when we hypothesized that the culprit for this were the current sources (as there was no reason to expect our LGN neurons to be slower than our V1 neurons, and as in our typical one-to-one connections use case we needed to add an additional current source for each additional LGN neuron). The tests we did later seemed to confirm this.

In any case, we observed the same phenomenon with an in-degree of 1 (in the case of this scaling analysis we couldn't use one-to-one connections as the number of neurons and sources can be different). Here are the results we obtained back then when trying different values of in-degrees:

indegree_dependence(2)

Note that this plot was obtained with the constraint that n_neurons >= n_current_sources. Beside that, the same values were used both for the number of threads and the numbers of neurons and current sources. I can recompute the data using all the combinations of n_neurons and n_current_sources if necessary.

We interpreted it the following way: It's logical that with higher in-degrees the simulation time scales relatively less with the number of current source, as increasing the number of neurons has then a bigger impact on the simulation (50 or 25 times more injection per additional neuron). Still, it seems that when increasing the number of threads the simulation time scales more with the number of current sources than with the number of neurons even when larger number of in-degree are used, so this for us was an indicator that this lack of parallelization is at the level of the generation of the current itself and not at the level of this injection. Nevertheless, I'm not 100% sure about this interpretation and it'd be great to have your opinion on that.

alex4200 commented 2 years ago

@heplesser Can you please comment on the above comment? Or can this ticket be closed @RCagnol ?

github-actions[bot] commented 2 years ago

Issue automatically marked stale!

heplesser commented 2 years ago

A solution for this problem is available as an extension module providing aeif_cond_exp and iaf_cond_exp models with built-in step current generator avoiding bottlenecks. I am therefore closing this issue.

nest / nest-simulator

Lack of parallelization of step_current_generator #2406