Add Multiprocess execution

eSoares commented 3 years ago

Add a LinkModel that performs the simulation in multiprocess, one process per SNR. The implementation works as a drop-in replacement, just need to import _multiprocesslinks instead of links.

Also adds a Wifi80211 multiprocess, where it can work one process per SNR, but can go further and perform one process per MCS x SNR. Example of usage available! This multiplexing helps a lot, from the example available, running in the classic single process would take about 3min:7s while the multiprocess (under my 8 core work machine) takes about 40 seconds.

Because the limitations of the multiprocess in Python, any parameter to be passed to the process that executes the simulation need to be pickle serializable. This causes some problems when passing custom functions for receiving, decoding and so-on, but as long as they are in the root of the file and not nested (a function defined inside of a function) should work.

eSoares commented 3 years ago

I have run the exact same tests has travis in a couple of different machines without any problem... If anyone know what could be the cause, fell free to comment!

BastienTr commented 3 years ago

What's surprising is that both failed tests have exactly the same error

BastienTr commented 3 years ago

The test is also OK on my laptop... Could this error come from a different library version ? Travis uses the most recent packages and we may have older ones.

coveralls commented 3 years ago

Coverage increased (+0.8%) to 80.069% when pulling 1205fc8fb2dcf8c2f53b9553c8b439cb0ed2be1a on eSoares:multiprocess into 2a60fc5fbde3a80b4c3dec7a44fb75edccc350f4 on veeresht:master.

eSoares commented 3 years ago

Looks like forcing to only use one process works for now... At least I had it run in travis twice with success. (Other run at: https://travis-ci.org/github/eSoares/CommPy/builds/721700025 )

BastienTr commented 3 years ago

Hi @eSoares, I'm not sure what to do with this PR which bothers me for many reasons.

Including code that can give a silent bug in some environments seems hazardous to me. Even if it does work on our individual computers, the failure on Travis suggest that this failure could occur for other users.
The limitation to picklable items looks cumbersome and confusing to the untrained eye. If we merge this PR, we could switch to pathos to work around this limitation. It would be better in my eyes even if it means adding a dependency.
The proposed parallelization is not optimal. Here, you parallelize the SNR loop which is not efficient. Indeed, the size of the iterations is varying (the best SNR requires a lot more computations than the worst to get the same number of errors). The better solution is either to parallelize on the runs (each core process a channel/receiver configuration...) or to divide the total computation on each core (if we ask for 200 errors on 2 cores, each core computes 100 errors and the average BER is returned).
I'm not sure that adding multiprocessing into the library is appropriate. Indeed, multiprocessing optimization should be left to the user, who often has the opportunity to parallelize more widely (launch several test configurations in parallel, for example). Moreover, numpy may already use several core in some situation depending on the installation setup. I had to turn off numpy's multithreading before adding my multiprocessing trick. If not, the computation time was far worst than the simple solution.

For your information, here is the multiprocessing I use, which gives much better speedup and scalability. Indeed, each core compute a whole SNR vs BER curve and then have more or less the same computation time. I use p_tqdm which is a wrapper around pathos that generate progress bars.

As explained earlier I have to turn off numpy's multithreading. This solution looks simple enough to me to let the user write is own code. That's why, I think that we should leave to the user this kind of optimization. May be add an example that guides on this point.

from commpy.links import link_performance
from p_tqdm import p_map

# Models as a list
models = []
# Define each model to test here...

# Helper function
def perf(model):
    return link_performance(model, SNRs, nb_it, nb_err, chunk, code_rate)

BERs = p_map(perf, models, num_cpus=min(12, nb_test))

This is only my point of view. I am fully open to your opinions and arguments.

eSoares commented 3 years ago

Hello Bastian,

I agree with some of your concerns. My PR is in the way of making easier use cases that I have found, note that nothing in this code interferes or changes the possibility to use other strategies to divide work.

Including code that can give a silent bug in some environments seems hazardous to me. Even if it does work on our individual computers, the failure on Travis suggest that this failure could occur for other users.

I can try to set the seed inside each process, that should make the multiprocess reproducible and more stable. If this PR goes forward, I will make this change.

The limitation to picklable items looks cumbersome and confusing to the untrained eye. If we merge this PR, we could switch to pathos to work around this limitation. It would be better in my eyes even if it means adding a dependency.

Did not know about this library, but looks a great improvement! Will use it in future! 👍

The proposed parallelization is not optimal. Here, you parallelize the SNR loop which is not efficient. Indeed, the size of the iterations is varying (the best SNR requires a lot more computations than the worst to get the same number of errors). The better solution is either to parallelize on the runs (each core process a channel/receiver configuration...) or to divide the total computation on each core (if we ask for 200 errors on 2 cores, each core computes 100 errors and the average BER is returned).

I guess the best strategy would be to parallelize the iterations inside link_performance, both the SNRs and the amount of sends until send_max, only the first is parallelized in the current state. A good change would be to iterate linearly for each SNR, but parallelize the sends until send_max with some shared memory to synchronize the amount of runs and BERs to know where to stop in each SNR.

I'm not sure that adding multiprocessing into the library is appropriate. Indeed, multiprocessing optimization should be left to the user, who often has the opportunity to parallelize more widely (launch several test configurations in parallel, for example).

Well, that is up for debate if such elements should exist in the library or not. Even if they are not in the core part, at least in some examples it should exist.

Moreover, numpy may already use several core in some situation depending on the installation setup. I had to turn off numpy's multithreading before adding my multiprocessing trick. If not, the computation time was far worst than the simple solution.

For what I found is particularly problematic in matrix multiplication[1] (which is not used?), but the rest of operations were not problematic. But I'm no export in mixing numpy with multiprocess, so I take your statement and should be used with care :) Maybe add a set of number of threads that numpy can use to 1 before any execution [2] and the taskset after import [1] if this code goes forward.

For your information, here is the multiprocessing I use, which gives much better speedup and scalability. Indeed, each core compute a whole SNR vs BER curve and then have more or less the same computation time. I use p_tqdm which is a wrapper around pathos that generate progress bars.

Thank you, is a nice library that I did not know of. 😃 As I mostly use docker to run the code (in a remote machine) the progress bar to the cli is not really useful, but still good to know the option.

About your example code in particular, that is kinda of what I do, but instead of parallelizing multiple models (since I only want to test one), I parallelize multiple SNRs (and MCS in the case of WiFi).

In the end the question about "what makes sense to parallelize" is "depends what you're comparing" and of course "how many cores can you throw at it". If you have enough cores, parallelizing for each SNR may take more advantage of the available computing.

Still even if this PR doesn't move forward, I think that at least an example and a set of instructions to be careful with numpy a multiprocess should result, for some less knowledgeable (as myself) don't make the mistakes you are pointing out! 😃

1 - https://stackoverflow.com/a/15647474 2 - https://stackoverflow.com/a/48665619

BastienTr commented 3 years ago

I can try to set the seed inside each process, that should make the multiprocess reproducible and more stable. If this PR goes forward, I will make this change.

Do you think that it's related to the seed? Here are the results for the erroneous test from the Travis logs:

build 230 outputs x: array([0.489444, 0.470556, 0.419444, 0.683333, 0.014394]),
build 231 outputs x: array([0.489444, 0.470556, 0.419444, 0.683333, 0.014394]). The result are strictly identical! This does not look like a random-related bug to me. Howerver, I can't find another explaination...

Well, that is up for debate if such elements should exist in the library or not. Even if they are not in the core part, at least in some examples it should exist.

I agree that this should be somewhere either as a core element or as an example with guidelines. The latter looks better to me since this would be the place to explain the workaround with numpy's multithreading. However, adding multiprocessing as a core element is not irrational at all. @veeresht, do you have a long-term opinion?

For what I found is particularly problematic in matrix multiplication[1] (which is not used?), but the rest of operations were not problematic. But I'm no export in mixing numpy with multiprocess, so I take your statement and should be used with care :) Maybe add a set of number of threads that numpy can use to 1 before any execution [2] and the taskset after import [1] if this code goes forward.

I run into the problem when propagating through MIMOFlatChannel objects. MIMOFlatChannel.propagate relies on a complex einsum wich uses numpy multithreading. I ends up on the same StackOverflow threads and now use export OPENBLAS_NUM_THREADS=1 or export MKL_NUM_THREADS=1 before my parallel runs. https://github.com/veeresht/CommPy/blob/2a60fc5fbde3a80b4c3dec7a44fb75edccc350f4/commpy/channels.py#L376-L379

As I mostly use docker to run the code (in a remote machine) the progress bar to the cli is not really useful, but still good to know the option.

Same here but I redirect the progress bar to a log file so that I have a costless insight on the computation time of each iterations.

About your example code in particular, that is kinda of what I do, but instead of parallelizing multiple models (since I only want to test one), I parallelize multiple SNRs (and MCS in the case of WiFi). In the end the question about "what makes sense to parallelize" is "depends what you're comparing" and of course "how many cores can you throw at it". If you have enough cores, parallelizing for each SNR may take more advantage of the available computing. & I guess the best strategy would be to parallelize the iterations inside link_performance, both the SNRs and the amount of sends until send_max, only the first is parallelized in the current state. A good change would be to iterate linearly for each SNR, but parallelize the sends until send_max with some shared memory to synchronize the amount of runs and BERs to know where to stop in each SNR.

If you test several MCS at once, it means that you're testing several models at once. In CommPy, a model includes the transmitter and the receiver.

Anyway, if you have enough cores to parallize on each model and then still having some core available, you can split the workload on the send_max axis easily. Here is a snippet where you have only one model. It is not hard to go for a more general code with the same idea.

# Create a tuple of models
models = (model,) * nb_core

# Helper function
def perf(model):
    return link_performance(model, SNRs, nb_it // nb_core, nb_err // nb_core, chunk, code_rate)

# Split the computation and agregate the result
partial_BERs = p_map(perf, models)
np.array(partial_BERs).mean(0)

veeresht / CommPy

Add Multiprocess execution #82