What is the fastest way to ran pytransit inside emcee in multiple-core CPU machine?

TakuNishiumi commented 3 years ago

Describe the bug This is not a bug, just a question. I ran the pytransit inside emcee with pool. It didn't speed up the calculation. What is the fastest way in multiple core CPU server?

To Reproduce We have 64 cores and 128 threads CPU server. The OS is ubuntu 18.04. I ran emcee like this using pytransit inside the log_probability function on the machine.

processers=1
with Pool(processes=processers) as pool:
    sampler = emcee.EnsembleSampler(nwalkers, ndim, log_probability, args = args, pool=pool)

It took 22 minutes. This uses 64 threads. Almost threads shows 99% in htop command. One thread showed CPU% > 6000.

I changed processers=2 and ran the same script. It took 17 minutes.

Finally, I ran this script. This script don't uses pool.

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_probability, args = args)

It took 17 minutes. Do you know why?

TakuNishiumi commented 3 years ago

With pool, processers=1 1processers

With pool, processers=2 processers2

Without pool without_pool

hpparvi commented 3 years ago

Hi Taku,

There are two straightforward ways to parallelise the code when using PyTransit with emcee. The first one is to use pool like you're doing, in which case you should set the 'NUMBA_NUM_THREADS' environment variable to 1 before running your script to ensure PyTransit isn't trying to parallelise each transit model evaluation separately. The second option is to parallelize the log_posterior function by yourself and giving vectorize=True when initializing the EnsembleSampler. When you do this, the sampler gives a 2D parameter vector array to your log posterior function and expects it to return an array of log posterior values, one per walker.

The second option can be much faster if you're running your code in a single machine with many cores (shared-memory parallelization) rather than in a cluster, as you're doing now. However, it does take a bit more work to set up and is useful only if the computationally heavy parts of the posterior evaluation are parallelized, for example, using OpenMP or OpenCL (for GPU computation).

PyTransit evaluates the transit model in parallel if you give parameter arrays to the evaluate method instead of scalars (you can find details in the example notebooks). This should be much faster than using pool, but only if the transit model evaluation is the most computationally demanding part of the posterior calculation.

Can you give a bit more details about your log_probability function? Especially, how are you calculating your log likelihood. If you're using GP log likelihood (George, Celerite, etc.), the first approach (with pool) is probably the best. If not, writing a log posterior function that evaluates the posterior density for all the walkers in one go (also something you can parallelize easily with numba) should give you a nice speedup.

Cheers, Hannu

TakuNishiumi commented 3 years ago

Hi Hannu,

Thank you. I uses george and the log_probability function calculates gp.log_likelihood(ys) + lp. gp is george.GP object and ys means light curve data. lp is prior. I uses PyTransit as the gp's mean model.

def transit_model_pytransit(pv, times):

    k, t0, u1, u2, p, a, i, e, w = pv
    m = QuadraticModel()
    m.set_data(times)
    flux_model = m.evaluate(k=k, ldc=[u1, u2], t0=t0, p=p, a=a, i=i, w=w, e=e) #calculates light curve
    return flux_model

I tried to set the 'NUMBA_NUM_THREADS' environment variable to 1, by numba.set_num_threads(1) before initializing EnsembleSampler. I ran the same script without pool, it showed strange behavior. htop showed 64 threads were used. I also tried to set numba.set_num_threads(1), it also showed 64 threads were used.

asmasca commented 3 years ago

It could be some other package (numpy, scipy) used internally that is tryint to paralellize something. I've seen this behaviour in some computers.

You can try including the following environmental variables.

os.environ['OPENBLAS_NUM_THREADS'] = '1'

and/or

os.environ['MKL_NUM_THREADS'] = '1'

TakuNishiumi commented 3 years ago

Dear @asmasca,

Thank you. I tried to use only os.environ['OPENBLAS_NUM_THREADS'] = '1' and both os.environ['OPENBLAS_NUM_THREADS'] = '1' and os.environ['MKL_NUM_THREADS'] = '1'. However, they didn't change. When I tried without pool, unfortunately it also still uses 64 threads.

TakuNishiumi commented 3 years ago

I set os.environ["OMP_NUM_THREADS"] = "1". And I ran the script with pool, it speed up. It took 8 minutes. Great! Thank you! However CPU% of each threads are very low...

hpparvi commented 3 years ago

Hi Taku,

Can you try two things:

First, move the QuadraticModel initialisation out of the transit_model_pytransit function. You should need to initialise the transit model only once in your script (since the times array doesn't change during the MCMC). Initialising a new transit model every time you want to evaluate the model is very inefficient because the initialisation needs to set up several index arrays etc. So, modify the code to something like

def transit_model_pytransit(pv, m):

    k, t0, u1, u2, p, a, i, e, w = pv
    flux_model = m.evaluate(k=k, ldc=[u1, u2], t0=t0, p=p, a=a, i=i, w=w, e=e) #calculates light curve
    return flux_model

This should make a huge difference! If it doesn't, check if George is actually the culprit by skipping the transit model evaluation and giving the log-likelihood function a constant array of ones.

Cheers, Hannu

TakuNishiumi commented 3 years ago

Hi Hannu,

Thank you. I will try it!

Sincerely, Taku

hpparvi / PyTransit

What is the fastest way to ran pytransit inside emcee in multiple-core CPU machine? #72