initialize the pool - Githubissues

Jane550 commented 12 months ago

Hi Johannes! The likelihood function in my case is very complex, and one of the reasons for its complexity is that it includes reading the data from an h5 file. Therefore, each time I call the likelihood function in parallelization I will need to open the h5 file, which I guess is very time-consuming. I plan to read the file to an array once and for all (the data is the same for every worker), and then for every process calling the likelihood with a different set of parameters, I can start from the array in the memory. Is what I want to do to 'initialize the pool'? Now I am using MPIPool from Schwimmbad. The structure is like this:

if __name__ == '__main__': 
    with MPIPool() as pool:
        if not pool.is_master():
            pool.wait()
            sys.exit(0)
        sampler = Sampler(prior, log_likelihood, n_dim=7, n_live=2000, seed=42, filepath='nautilus-record-true-1120.hdf5', pool=pool) 
        sampler.run(verbose=True, discard_exploration=True)

Maybe I need to add something to this structure?

johannesulf commented 12 months ago

Thanks for reaching out. Unfortunately, I'm not that familiar with Schwimmbad. Additionally, I'd need to know a bit more about your problem to understand whether the above code will give a decent performance. The above code will most likely not require you to read a file during every likelihood call. However, it may still require to transfer of memory from the main to the sub-processes during every call. This can be a severe bottleneck in many situations.

Do you actually need MPI, i.e., are you trying to parallelize over multiple nodes of a cluster? If you just want to parallelize over multiple cores of a CPU, you could use Nautilus' internal SMP parallelization which takes care of this issue.

Jane550 commented 12 months ago

If I use SMP, do I just include the reading file part in the likelihood function or do I still have to do the initialization somewhere? Btw, the 'pool' in SMP can be a tuple. The first one specifies the pool used for likelihood calls and the second one the pool for sampler calculations. Suppose I have 72 cores in one node, and I'm using SMP, which is the fastest way of parallelization? To use a tuple, say, 'pool=(72,72)' or just 'pool=72'?

johannesulf commented 12 months ago

If you use the SMP parralelization, i.e., pool=72, nothing else needs to be done. Nautilus will internally distribute the likelihood function to all workers. Regarding the second question: pool=(72,72) and pool=72 should basically perform equally well since they both distribute internal Nautilus calculations and likelihood computations over 72 cores. The only difference is that the former has two separate pools. But this shouldn't really make a difference.

Jane550 commented 12 months ago

def esd_model(theta): 
    esd=[]
    nbr_directory='/home/mingtaoyang/lensing_fairness_of_comparison/emulation'
    lens_directory='/home/mingtaoyang/lensing_preface/read_central_galaxy_cat'

    # ------------------------------------- true gap -----------------------------------------------------
    filenames=['nbrtrue0_1.hdf5','nbrtrue0_2.hdf5','nbrtrue0_3.hdf5','nbrtrue0_4.hdf5','nbrtrue0_5.hdf5','nbrtrue0_6.hdf5','nbrtrue1_1.hdf5','nbrtrue1_2.hdf5','nbrtrue1_3.hdf5','nbrtrue1_4.hdf5','nbrtrue1_5.hdf5','nbrtrue1_6.hdf5']
    right_edges=[12,13,13,14,15,16,12,13,13,14,15,16]
    for index,filename in enumerate(filenames):
        esd_sm=np.zeros(8) # contain 6 values, the average esd of 6 radial bins
        esd_rad=[[] for n in range(8)] # 6 radial bins 
        range_left=right_edges[index]-8
        range_right=right_edges[index]
        #range_left=8 # fit the 8,9,10,11,12-th pts of the largest 3 stellar mass bins
        with h5py.File(os.path.join(lens_directory,filename[3:-5]+'_ready.hdf5'),'r') as lens_cat:
            z_l=lens_cat['data'][:,3]
            logLc=lens_cat['data'][:,5]
            logLgap=lens_cat['data'][:,6]
        with h5py.File(os.path.join(nbr_directory,filename),'r') as f:
            for i in f['lens_index'][:]:
                logMh=logMh_model(theta,logLc[int(i)],logLgap[int(i)])/cosmo['h']
                dis_com=[]
                count_edges=[0]
                for j in range(range_left,range_right):
                    dis_com.append(f['distance_com'][str(i)][:][f['radidx'][str(i)][:]==j])
                    count_edges.append(count_edges[-1]+len(dis_com[-1]))
                dis_com=np.concatenate(dis_com)
                esd_con=(HaloProfileNFW(c_M_relation=concentration, fourier_analytic=True, projected_analytic=True, cumul2d_analytic=True, truncated=False).cumul2d(cosmo, r_t=dis_com, M=float(10**logMh), a=1/(1+z_l[int(i)]), mass_def=mass_definition)-HaloProfileNFW(c_M_relation=concentration, fourier_analytic=True, projected_analytic=True, cumul2d_analytic=True, truncated=False).projected(cosmo, r_t=dis_com, M=float(10**logMh), a=1/(1+z_l[int(i)]), mass_def=mass_definition))*np.power(1+z_l[int(i)],2) # physical esd concatenate #,logLgap[int(i)]
                for j in range(8):
                    esd_rad[j].append(esd_con[int(count_edges[j]):int(count_edges[j+1])])
        for k in range(8):
            esd_sm[k]=np.mean(np.concatenate(esd_rad[k]))
        esd.append(esd_sm)
    return np.concatenate((esd[0],esd[1],esd[2],esd[3],esd[4],esd[5],esd[6],esd[7],esd[8],esd[9],esd[10],esd[11]))*10**(-12)/cosmo['h'] # in the unit of M_sun hpc^{-2}

def log_likelihood(param_dict):
    theta=np.array([param_dict['logMa'], param_dict['logMb'], param_dict['beta1'],param_dict['alpha2'], param_dict['beta2'],param_dict['beta3'], param_dict['gamma3']])
    res=esd_model(theta)
    model=res
    ll=multivariate_normal(mean=model,cov=cov1).logpdf(y1)
    return ll

The above is my likelihood function, so if I use SMP I don't need to modify the reading file part in 'esd_model'? The h5 files will be read only once and broadcast to every call of likelihood?

johannesulf commented 12 months ago

Not quite, actually. You still need to make sure to write your function in such a way that a call to log_likelihood will not trigger reading in a file for every likelihood call. Currently, your code still does that, it seems. Sorry if I misunderstood you initially.

Jane550 commented 12 months ago

Yeah, the code now does read the file in every call. I thought SMP could automatically take care of that, which confused me a lot. So my question is where should I put the reading file part such that the code only reads it once.

johannesulf commented 12 months ago

What the SMP implementation of Nautilus can take of is that you don't have to send the likelihood function and associated data from the main to the sub-process during each likelihood call. However, it does not take care of how you read in the data. This is outside the scope of what Nautilus does.

One way to implement this is that you could read in the file and then have the likelihood function take the data as an additional argument, i.e., def log_likelihood(data, param_dict): and then add likelihood_args=[data, ] when initializing the sampler.