Closed Jane550 closed 10 months ago
Thanks for reaching out. Unfortunately, I'm not that familiar with Schwimmbad. Additionally, I'd need to know a bit more about your problem to understand whether the above code will give a decent performance. The above code will most likely not require you to read a file during every likelihood call. However, it may still require to transfer of memory from the main to the sub-processes during every call. This can be a severe bottleneck in many situations.
Do you actually need MPI, i.e., are you trying to parallelize over multiple nodes of a cluster? If you just want to parallelize over multiple cores of a CPU, you could use Nautilus' internal SMP parallelization which takes care of this issue.
If I use SMP, do I just include the reading file part in the likelihood function or do I still have to do the initialization somewhere? Btw, the 'pool' in SMP can be a tuple. The first one specifies the pool used for likelihood calls and the second one the pool for sampler calculations. Suppose I have 72 cores in one node, and I'm using SMP, which is the fastest way of parallelization? To use a tuple, say, 'pool=(72,72)' or just 'pool=72'?
If you use the SMP parralelization, i.e., pool=72
, nothing else needs to be done. Nautilus will internally distribute the likelihood function to all workers. Regarding the second question: pool=(72,72)
and pool=72
should basically perform equally well since they both distribute internal Nautilus calculations and likelihood computations over 72 cores. The only difference is that the former has two separate pools. But this shouldn't really make a difference.
def esd_model(theta):
esd=[]
nbr_directory='/home/mingtaoyang/lensing_fairness_of_comparison/emulation'
lens_directory='/home/mingtaoyang/lensing_preface/read_central_galaxy_cat'
# ------------------------------------- true gap -----------------------------------------------------
filenames=['nbrtrue0_1.hdf5','nbrtrue0_2.hdf5','nbrtrue0_3.hdf5','nbrtrue0_4.hdf5','nbrtrue0_5.hdf5','nbrtrue0_6.hdf5','nbrtrue1_1.hdf5','nbrtrue1_2.hdf5','nbrtrue1_3.hdf5','nbrtrue1_4.hdf5','nbrtrue1_5.hdf5','nbrtrue1_6.hdf5']
right_edges=[12,13,13,14,15,16,12,13,13,14,15,16]
for index,filename in enumerate(filenames):
esd_sm=np.zeros(8) # contain 6 values, the average esd of 6 radial bins
esd_rad=[[] for n in range(8)] # 6 radial bins
range_left=right_edges[index]-8
range_right=right_edges[index]
#range_left=8 # fit the 8,9,10,11,12-th pts of the largest 3 stellar mass bins
with h5py.File(os.path.join(lens_directory,filename[3:-5]+'_ready.hdf5'),'r') as lens_cat:
z_l=lens_cat['data'][:,3]
logLc=lens_cat['data'][:,5]
logLgap=lens_cat['data'][:,6]
with h5py.File(os.path.join(nbr_directory,filename),'r') as f:
for i in f['lens_index'][:]:
logMh=logMh_model(theta,logLc[int(i)],logLgap[int(i)])/cosmo['h']
dis_com=[]
count_edges=[0]
for j in range(range_left,range_right):
dis_com.append(f['distance_com'][str(i)][:][f['radidx'][str(i)][:]==j])
count_edges.append(count_edges[-1]+len(dis_com[-1]))
dis_com=np.concatenate(dis_com)
esd_con=(HaloProfileNFW(c_M_relation=concentration, fourier_analytic=True, projected_analytic=True, cumul2d_analytic=True, truncated=False).cumul2d(cosmo, r_t=dis_com, M=float(10**logMh), a=1/(1+z_l[int(i)]), mass_def=mass_definition)-HaloProfileNFW(c_M_relation=concentration, fourier_analytic=True, projected_analytic=True, cumul2d_analytic=True, truncated=False).projected(cosmo, r_t=dis_com, M=float(10**logMh), a=1/(1+z_l[int(i)]), mass_def=mass_definition))*np.power(1+z_l[int(i)],2) # physical esd concatenate #,logLgap[int(i)]
for j in range(8):
esd_rad[j].append(esd_con[int(count_edges[j]):int(count_edges[j+1])])
for k in range(8):
esd_sm[k]=np.mean(np.concatenate(esd_rad[k]))
esd.append(esd_sm)
return np.concatenate((esd[0],esd[1],esd[2],esd[3],esd[4],esd[5],esd[6],esd[7],esd[8],esd[9],esd[10],esd[11]))*10**(-12)/cosmo['h'] # in the unit of M_sun hpc^{-2}
def log_likelihood(param_dict):
theta=np.array([param_dict['logMa'], param_dict['logMb'], param_dict['beta1'],param_dict['alpha2'], param_dict['beta2'],param_dict['beta3'], param_dict['gamma3']])
res=esd_model(theta)
model=res
ll=multivariate_normal(mean=model,cov=cov1).logpdf(y1)
return ll
The above is my likelihood function, so if I use SMP I don't need to modify the reading file part in 'esd_model'? The h5 files will be read only once and broadcast to every call of likelihood?
Not quite, actually. You still need to make sure to write your function in such a way that a call to log_likelihood
will not trigger reading in a file for every likelihood call. Currently, your code still does that, it seems. Sorry if I misunderstood you initially.
Yeah, the code now does read the file in every call. I thought SMP could automatically take care of that, which confused me a lot. So my question is where should I put the reading file part such that the code only reads it once.
What the SMP implementation of Nautilus can take of is that you don't have to send the likelihood function and associated data from the main to the sub-process during each likelihood call. However, it does not take care of how you read in the data. This is outside the scope of what Nautilus does.
One way to implement this is that you could read in the file and then have the likelihood function take the data as an additional argument, i.e., def log_likelihood(data, param_dict):
and then add likelihood_args=[data, ]
when initializing the sampler.
I now read in the data from h5 files at the beginning of the Python script. Some of the data are arrays, and some are lists. And I define 'log_likelihood' as log_likelihood(y1,cov1,z,dis_com,logLc,logLgap,radidx,start_bins,end_bins,param_dict), and I add 'likelihood_args=[y1,cov1,z,dis_com,logLc,logLgap,radidx,start_bins,end_bins, ]' in 'Sampler'. Did I correctly initialize the pool by doing these? I required 72 processes. But when I checked the exact number of CPU used, I found that the number is not always 72. It decreased and increased with time, sometimes there are only 10 CPU working. Maybe this bottleneck shows that I didn't initialize the pool correctly, or other possible reasons. I think not always making full use of the CPUs is an important reason for low speed. But I really don't know why it happens.
I think you probably correctly initialized the pool and that not all cores are used all the time is expected. First, nautilus processes likelihood evaluations in batches. The default batch size is 100. For 72 CPU cores, that means that for each batch, the first 72 are processed by all cores whereas there are another 28 evaluations that only keep 28 cores busy. So choose a batch size that's a multiple of 72, i.e., 72 or 144. You can set this with n_batch=72
. Additionally, not all parts of nautilus can make use of all cores. But I hope that most of the time, you're using 72 cores.
Thank! And I noticed in the doc that by setting 'vectorized=True', the likelihood function can receive multiple input sets at once. Is this something like the parallelism, where we call likelihood for different parameters simultaneously. And other quick question, can one core evaluate more than one likelihood at once if the memory allows? Say I got 72 cores in total, but when I set n_batch=72, I actually have 2 likelihoods evaluated simultaneously in the same core, right?
No, vectorization is something different. It refers to likelihood functions that can be sped up through the use of NumPy vectorization. I don't think this applies to you. I've also implemented it in such a way that parallelization is turned off if vectorization is turned on. So make sure to set vectorized=False
. I don't quite understand your second question. But let me clarify that each process solves every likelihood evaluation one by one. It never solves 2 likelihoods simultaneously (unless you use vectorization).
Okay, I understood now. By setting b_batch multiple times of the cores I used, and read in the data once in 'likelihood_args' am I making the most of the Nautilus SMP parallel structure then? Is there any other thing I can improve to speed up regardless of the specific form of the likelihood function.
No, I think what we discussed should take care of the most important performance aspects.
what's the difference when the batch sizes are different times of the number of cores? Is the case that after a batch of likelihood evaluations, we do a boundary decision? I tried 72, 144, and 216 batch sizes for the 72 cores case. And it seemed that 216 is more efficient than 144, and 144 is more efficient than 72.
Under normal circumstances, there shouldn't be a big difference between the different batch sizes. The only real difference for the algorithm is that with larger batch sizes you will get new boundaries a little later since new boundaries are only created after a certain number of high-likelihood points are found (and only after a batch is finished). In practice, if your likelihood evaluations have variable execution times, larger batch sizes may be better. For example, let's assume your likelihood evaluation time can vary from 0.5 seconds to 3 seconds per core. With a batch size of 72, the whole batch would be determined by the largest value, i.e., 3 seconds. In other words, those cores that have a fast likelihood call are idle during most of the time. However, with a batch size of 216, the relative variation in the computation time may be less since every core solves 3 likelihoods, on average.
I see. As I understood, the most appropriate batch_size actually vary for different cases. We will have to try to find the best value balancing evaluating the likelihood and finding new points in the correct direction, is that the spirit? And here is a small question that is not that relavant to this issue. Is it possible to resume the sampling process for an extended parameter space? For example, my former parameter space is $a\in [9,11]$, and $b\in [10,12]$. Then I found in the corner plot that maximum likelihood estimate for b is quite at the edge, say $b=10.01$, it could be even smaller than 10. In this case I would like to extend the boundaries for $b$ to $[8,12]$, for example. Then do I have to recalculate the likelihood all over or can I make use of the points in the former region that already have a likelihood evaluated.
No, if you change your prior, you have to re-run from scratch. In principle, one could use results from the first run but this isn't implemented right now. Also, I don't think there are many samplers that would support such a scenario.
Okay. Thanks for your patience!
Hi Johannes! The likelihood function in my case is very complex, and one of the reasons for its complexity is that it includes reading the data from an h5 file. Therefore, each time I call the likelihood function in parallelization I will need to open the h5 file, which I guess is very time-consuming. I plan to read the file to an array once and for all (the data is the same for every worker), and then for every process calling the likelihood with a different set of parameters, I can start from the array in the memory. Is what I want to do to 'initialize the pool'? Now I am using MPIPool from Schwimmbad. The structure is like this:
Maybe I need to add something to this structure?