swincas / cookies-n-code

A repo for code review sessions at CAS
http://astronomy.swin.edu.au/
MIT License
31 stars 34 forks source link

Having trouble running emcee jobs on OzSTAR #19

Closed caitlinadams closed 6 years ago

caitlinadams commented 6 years ago

I've just transitioned over to OzSTAR, but can't get my emcee jobs to run. They appear to be submitted but then disappear. No error or output files are being generated. The program was running fine on g2, so I think it must be something either with installed packages or my sbatch script. @manodeep have you run emcee on OzSTAR yet?

The steps I took to install emcee are:

  1. load the anaconda package
  2. create a conda environment
  3. pip install emcee

The jobscript looks like this:

#SBATCH -J emcee_mock_1
#SBATCH -o ozstar.swin.edu.au:/home/cadams/submissions/output/emcee_mock_1_simultaneous_meanmocknexp_nexpnorm_baddzero_fbeta_kmax0p1
5_z0p1_sigg3p0_sigu15p0_d2030_2018528.out
#SBATCH -e ozstar.swin.edu.au:/home/cadams/submissions/error/emcee_mock_1_simultaneous_meanmocknexp_nexpnorm_baddzero_fbeta_kmax0p15
_z0p1_sigg3p0_sigu15p0_d2030_2018528.err
#SBATCH --account=oz073
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --time=24:00:00
#SBATCH --mem-per-cpu=4G

echo `date`
module purge
module load anaconda3/5.1.0
source activate rsd_dv_analysis
cd /fred/oz073/cadams/RSD_DV_Analysis
srun python emceerun.py ./inputoutput/Covariances/odensz0.1_pvz0.053_d2030_ kmin0.002500_kmax0.150000_sigmag3.000000 norsd_kmin0.150
000_kmax1.000000 kmin0.002500_kmax0.150000_sigmag3.000000_sigmau15.000000 kmin0.002500_kmax0.150000_sigmau15.000000 ./inputoutput/Gr
idded_Data/ mock_1 _odensz_gridded_30mpch_odens_sample_meanmocknexp_nexpnorm.txt _pvz_mostmass_gridded_20mpch_vel_sample_consterr_0.
12_compl_NEW.txt ./inputoutput/Chains/ simultaneous_meanmocknexp_nexpnorm_baddzero_fbeta_kmax0p15_z0p1_sigg3p0_sigu15p0_d2030 400 50
0 --simultaneous &
wait
echo `date`
echo Job done
source deactivate

Within emceerun.py, I have requested 16 threads:

dd_fit_args, dd_add_args = gen_dd_args(args.cov_path, args.data_name,
                                    args.dd_lin_ext, args.dd_nonlin_ext)
vv_args = gen_vv_args(args.cov_path, args.data_name, args.vv_ext)
dv_args = gen_simultaneous_dv_args(args.cov_path, args.data_name,
                                            args.dv_ext)

print("Reading data")
odens_data_dict = get_odens_data(args.data_path, args.data_name,
                                        args.dens_specifier)
vel_data_dict = get_vel_data(args.data_path, args.data_name,
                                        args.vel_specifier)

n_dimensions = 4 #Number of free parameters
n_walkers = args.nwalkers #Number of independent walkers
n_steps = args.nsteps #Number of steps each walker takes

save_interval = 300 #Save chains every x seconds 300 -> 5 minutes

chain_file, loglike_file, chi2_file = gen_output_files(args.out_path,
                                                args.data_name, args.out_name)

r_g = 1.0 #Initially fixed cross-correlation coefficient
fsig8_initial = 0.5
bsig8_fit_initial = 1.2
beta_fit_initial = fsig8_initial/bsig8_fit_initial
bsig8_add_initial = 1.2
sigv_initial = 250
ld_initial = 0.09

initialguess = [fsig8_initial, sigv_initial, ld_initial, beta_fit_initial]
perturbation = [0.01, 1, 0.01, 0.01]

pos_initial = [initialguess + perturbation*np.random.randn(n_dimensions) for i in range(n_walkers)]

sampler = emcee.EnsembleSampler(n_walkers, n_dimensions, lnprob_full, threads=16, args=[r_g, dd_fit_args, dd_add_args, vv_args, dv_args, odens_data_dict, vel_data_dict])

run_emcee(save_interval, sampler, pos_initial, n_walkers, n_steps, n_dimensions, chain_file, loglike_file, chi2_file)

I've also tried submitting the job on an interactive node using salloc --account=oz073 --nodes=1 --ntasks-per-node=16 --time=4:00:00 --mem-per-cpu=4G The print statements from emceerun.py did appear for each thread, but then the program never got any further.

Any help would be greatly appreciated! I'm at a complete loss for what I'm doing wrong.

manodeep commented 6 years ago

I haven't run emcee yet. I have a feeling that this is because the mpi4py package is from conda while srun is using a different openmpi version.

What happens if you completely disable conda, (i.e., remove the /path/to/conda in your $PATH variable) and then run module load python, pip install emcee --user etc.

If that does not work, you may want to ask the Swinburne hpc-support

caitlinadams commented 6 years ago

The interactive version got further than previously but still seemed to get stuck. I submitted it as a batch script again and saw:

[cadams@farnarkle2 submissions]$ squeue -u cadams
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            220425   skylake emcee_mo   cadams PD       0:00      1 (None)
[cadams@farnarkle2 submissions]$ squeue -u cadams
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            220425   skylake emcee_mo   cadams CG       0:01      1 john16
[cadams@farnarkle2 submissions]$ squeue -u cadams
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

So it went from PENDING to CLOSING. I've already asked hpc-support if they'll install emcee as a module, so will write to them, in addition, to say that I'm having trouble launching jobs with it -- either using anaconda or using pip install emcee --user.

Thanks for the help! I will reply here if I'm able to get it working for anyone who might want to work with emcee in the future.

caitlinadams commented 6 years ago

I've resolved it with hpc-support. It was to do with how I specified my output and error files in the sbatch script. Turns out slurm was confused by the ozstar.swin.edu.au: prefix that I had. This used to work fine in my g2 scripts, so I had also applied the practice here.

As for emcee, it all appears to be working. I'm currently using the suggestion of pip install emcee --user -- so thanks for that, @manodeep!

caitlinadams commented 6 years ago

For anyone who might need this in the future, it should also be noted that srun is not required for emcee. srun in this case was creating 16 copies of the job, each of which was trying to spawn 16 threads.