The performance of map_rect function

Jinyin-Hu commented 10 months ago

Summary:

Please provide a short couple sentence summary. It seems the map_rect function does not work as the way I expected. is there anything wrong with the way I used? Thanks.

Description:

Describe the issue as clearly as possible. My problem involves intensive computation. And reduce map function didn't help much so I have to try map_rect. In my code, I broken the dataset into 21 shards. There are 4 chains, and I set 21 threads per chain. On HPC, i used 96 cpus to run the code. But only ~40% cpu used.

The main script

from cmdstanpy import CmdStanModel

stan_file = 'test.stan' data_file = 'data.json'

model = CmdStanModel(stan_file=stan_file, cpp_options={'STAN_THREADS':'True' }, force_compile=True)

fit = model.sample(data=data_file, iter_warmup=150, iter_sampling=1000, adapt_delta=0.97, \ save_warmup=True, show_progress=True, show_console=True, seed=20001,chains=4, parallel_chains=4, threads_per_chain=21, max_treedepth=20) fit.save_csvfiles(dir='output_test')

print (fit.summary()) print (fit.diagnose()) ################

Additional Information:

Provide any additional information here.

Current Version:

Please include the output of import cmdstanpy; cmdstanpy.show_versions(), or at least the cmdstan and cmdstanpy versions used. cmdstanpy 1.2.0 cmdstan 2.33.1

WardBrian commented 10 months ago

If there is an issue here I believe it would be tied to the Math library rather than cmdstanpy. But it's also possible that there just isn't enough work to saturate all those threads

bob-carpenter commented 10 months ago

It seems the map_rect function does not work as the way I expected.

What were you expecting and what happened? No speedup? Slowdown? You're only going to use as many cores as you have shards, but that's going to happen per chain if you're running multiple chains.

I would think that with calls to fft (n * log n) that it would be worthwhile to shard, but the real wins are when there's a cubic operation with quadratic data, like matrix multiply or any solves, or when there's something really compute intensive for little data, like an ODE solver.

Jinyin-Hu commented 10 months ago

Thanks @WardBrian @bob-carpenter, when I run it on my own PC using 4 chains 1000 warm-up and 1000 sampling, it took ~4 hours without map-rect function. Now using map-rect function, and runing on a HPC by using 96 cpus, it still cost more than 4 hours. So I'm not sure if something wrong with my code. The output is reasonable, only the efficiency.

bob-carpenter commented 10 months ago

You might want to see if running only a single chain speeds up with the cluster compared to a single chain on your PC.

There are all kinds of things that can get in the way on a big cluster, like I/O bottlenecks in the output and most importantly memory contention and thread contention (because 4 chains with multiple threads per chain might take over the number of cores you have).

Jinyin-Hu commented 10 months ago

Hi @bob-carpenter for a update, I tested it with only one chain and reduce the number of sampling to 200. It took 1 hour 17 mins without any parallel computing on HPC, and took 21 mins using map-rect function with 21 shards. So I think it works with only one chain. And the reason I chose 21 shards is that I used data ( time series ) from 7 instruments with 3 channels each. I assume all 21 channels are independent. when trying multipule chains, one thing confused me. if cross-chains multi-threading and within-chain multi-threading can work at the same time? e.g., I set the parallel_chains=chains, threads_per_chain=21. If both them apply, are these parallel chains from single one process, or individual process? I noticed there is one parameter in sample, force_one_process_per_chain. so far, it is set None by default. Should I change it to 'True' when runing multiple chains and within-chain parallelization? This confused me because I did not figure out by reading the documentation and examples. Many thanks.

bob-carpenter commented 10 months ago

Thanks for the update.

if cross-chains multi-threading and within-chain multi-threading can work at the same time?

I think so, but I'm not 100%, so I'd like to hear from @SteveBronder.

The problem may just be thread lock contention and/or memory contention. Everything in Stan winds up going back to CPU to save, etc.

WardBrian commented 10 months ago

I noticed there is one parameter in sample, force_one_process_per_chain. so far, it is set None by default. Should I change it to 'True' when runing multiple chains and within-chain parallelization? This confused me because I did not figure out by reading the documentation and examples. Many thanks.

This argument controls whether each chain is run in a separate process or if one process runs all the chains on separate threads. If you're using a newer cmdstan (2.28 or newer), then this argument should detect that and be equivalent to setting it to False. You can try True if your data is small enough that fitting n_chains copies into memory isn't an issue

Jinyin-Hu commented 10 months ago

Thanks @bob-carpenter @WardBrian , I just tried an extra test of 2 chains in seperate processes (ie, force_one_process_per_chain=True). cross-chain and within-chain multi-threading can work at the same time. The walltime is 41 mins with 48 cpus. Compared with the previous test of only one chain with 24 cpus which took 21 mins, it didn't improve in terms of the computational cost for my case. I agree with @bob-carpenter thread lock contention and/or memory contention could be the reason. Actually, hours cost is not that bad for me because we are not doing something with real-time analysis. Thanks.

stan-dev / cmdstanpy