joezuntz / cosmosis

Other
22 stars 16 forks source link

MPI runs produce "normal" stderr and stdout and log info but empty output files #100

Closed MinhMPA closed 8 months ago

MinhMPA commented 12 months ago

When I run cosmoSIS (with the proper mpi flag) on our HPC, either interactively or remotely via slurm, the program would execute, there would be stderr and stdout files. Inside these files, there is the usual log. The log even states that the output is being saved to output file. For example, there will be lines like this:

Saving output -> nauti_des-y3_kids-1000_chain_nauti_des-y3_and_kids-1000.txt

However, the output files above were created and then left untouched, i.e. there was no write (nor read) after their creation at the start of the program. I tried two different samplers, polychord and nautilus to verify the issue is with CosmoSIS I/O itself.

I would appreciate any hint on how to proceed with debugging.

joezuntz commented 12 months ago

Hi @MinhMPA ,

This isn't a bug - both the polychord and nautilus samplers (and multinest) wait a long time and then generate a large chunk of output at once. If you try, for example, the emcee sampler, then it should generate more continuous output.

If the length of time you can run for is too short for any output to appear then you can set resume = T in the [runtime] section and it will save a second file alongside the main output file saving progress as those samplers go along. (I think you also have to set polychord_outfile_root to a file name in polychord too).

Then you can re-start the chain when you run out of time.

I'll close this issue now but please do re-open if you have any more questions.

MinhMPA commented 12 months ago

Hi @joezuntz, actually I do want to follow up because I have kept the program running on 36x4 cores (36 processes, 4 cpus per each) for 4 days on my HPC. The programs did not complete after 4 days, it got terminated at 4 days due to the time limit.

Still, only the polychord run produced about 1 thousand lines in the output file, including the comment lines. The log files on the other hand are very long, so maybe I should decrease the verbosity level, but cosmoSIS taking that long to finish seems very unusual to me.

P.S. I did check CPU loads on the HPC nodes and they indicated that all CPUs were used.

joezuntz commented 12 months ago

Hi @MinhMPA

That might be a problem or it might be expected, depending on your parameter space - how many parameters are you using, and how long does one likelihood take?

I guess you exported OMP_NUM_THREADS=4 in your submission script to set the number of cores? You have to do both that and set any option in the batch system, usually.

On the verbosity, the usual thing you want to decrease is the camb "feedback" setting. If you set it to zero that cuts out a lot of the messages.

MinhMPA commented 12 months ago

Thank you, @joezuntz! This has been very helpful already! Let me try your suggestion and return to you once I have more info from the new runs.

For the record, I was just running the ini file from the DES-Y3 and KiDS-1000 cosmic shear consistency paper: https://github.com/joezuntz/cosmosis-standard-library/blob/main/examples/des-y3_and_kids-1000.ini

MinhMPA commented 12 months ago

Hi @joezuntz! I am running some new runs on the same ini files. From your comments I have made sure the feedback and verbosity is suppressed as much as it could be. I also realized that if I don't set some environment variables, e.g. the CFITSIO library paths, to match the paths I used when compiling, CosmoSIS won't spawn the correct number of MPI tasks.

But even after making sure of the things above, nautilus and polychord runs with 36x3 CPUs each have not output anything after a wallclock time of ~16 hours. I attach my submission script, the stderr and stdout files below. One strange thing I notice is that stdout is mostly filled with

Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5

that seems to indicate CosmoSIS is getting stuck during the likelihood evaluation of cosebis?

submission_script_and_std.zip Screenshot 2023-09-09 at 10 03 49 PM (3)

MinhMPA commented 11 months ago

Hi @joezuntz, upon more investigation, I think the issue might indeed be with cosebis, I also get this kind of warning message when cosmoSIS tries to setup cosebis

Setting up module cosebis_like
-------------------------------
Covariance matrix is not in the same order as the 2pt measurement extensions...doing some damn fiddly
re-ordering, if I screw it up it's your fault for not putting your covariance in the right order
Setup all pipeline modules

Then I will end up with hundreds to thousands of these same lines

Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5

one next to the other.

This happens to both polychord and nautilus runs. I did ask @johannesulf to check my nautilus sampler status file when the run is finally terminated (after 3 days on 3 HPC nodes using full 3x36CPUs). To quote Johannes directly,

your output clearly indicates that nautilus is far from done. It basically just started the computation, only having to evaluate around 20,000 likelihoods.

I believe I'm speaking for @elisaferreira, my collaborator, as well. She's having the same issue trying to run the same DES Y3 and KiDS-1000 cosmic shear consistency run(s).

joezuntz commented 11 months ago

Hi @MinhMPA - could you try running with the setting timing=T in the [pipeline] setting? That will print out some timing information we can use to work out how long it will take for you. It might be that the cosebis is taking longer than expected on your system for some reason.

Could you also email me the script you are using to submit to the supercomputer?

MinhMPA commented 11 months ago

Hi @joezuntz, thanks for the suggestion! I got the timings here:

consistency took: 0.013 seconds
consistency took: 0.022 seconds
consistency took: 0.023 seconds
camb took: 48.261 seconds
extrapolate took: 0.161 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.012 seconds
photoz_bias_kids took: 0.010 seconds
fast_pt took: 1.870 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.075 seconds
pk_to_cl_des took: 1.728 seconds
add_intrinsic took: 0.010 seconds
2pt_shear took: 0.337 seconds
shear_m_bias took: 0.006 seconds
2pt_like took: 0.027 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.094 seconds
pk_to_cl_kids took: 2.257 seconds
add_intrinsic took: 0.014 seconds
camb took: 63.448 seconds
extrapolate took: 0.193 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.010 seconds
photoz_bias_kids took: 0.008 seconds
fast_pt took: 1.453 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.068 seconds
camb took: 66.029 seconds
extrapolate took: 0.164 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.011 seconds
photoz_bias_kids took: 0.010 seconds
pk_to_cl_des took: 1.840 seconds
add_intrinsic took: 0.010 seconds
2pt_shear took: 0.354 seconds
shear_m_bias took: 0.007 seconds
2pt_like took: 0.028 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.105 seconds
cosebis took: 13.260 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.005 seconds
Total pipeline time: 68.2 seconds
consistency took: 0.003 seconds
fast_pt took: 2.182 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.076 seconds
pk_to_cl_kids took: 2.339 seconds
add_intrinsic took: 0.015 seconds
pk_to_cl_des took: 1.991 seconds
add_intrinsic took: 0.011 seconds
2pt_shear took: 0.342 seconds
shear_m_bias took: 0.006 seconds
2pt_like took: 0.036 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.126 seconds
pk_to_cl_kids took: 2.595 seconds
add_intrinsic took: 0.014 seconds
cosebis took: 6.503 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.004 seconds
Total pipeline time: 80.1 seconds
consistency took: 0.003 seconds
cosebis took: 10.280 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.002 seconds
Total pipeline time: 80.2 seconds
consistency took: 0.002 seconds
camb took: 32.860 seconds
extrapolate took: 0.155 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.011 seconds
photoz_bias_kids took: 0.010 seconds
fast_pt took: 0.380 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.069 seconds
pk_to_cl_des took: 1.743 seconds
add_intrinsic took: 0.010 seconds
2pt_shear took: 0.333 seconds
shear_m_bias took: 0.006 seconds
2pt_like took: 0.027 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.095 seconds
pk_to_cl_kids took: 2.281 seconds
add_intrinsic took: 0.014 seconds
cosebis took: 5.059 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.003 seconds
Total pipeline time: 43.1 seconds
consistency took: 0.003 seconds
camb took: 55.318 seconds
extrapolate took: 0.250 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.011 seconds
photoz_bias_kids took: 0.010 seconds
fast_pt took: 0.398 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.071 seconds
pk_to_cl_des took: 1.883 seconds
add_intrinsic took: 0.011 seconds
2pt_shear took: 0.457 seconds
shear_m_bias took: 0.010 seconds
2pt_like took: 0.054 seconds
choose_kids_ia took: 0.003 seconds
IA took: 0.142 seconds
pk_to_cl_kids took: 2.490 seconds
add_intrinsic took: 0.010 seconds
cosebis took: 3.220 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.003 seconds
Total pipeline time: 64.4 seconds
consistency took: 0.017 seconds
camb took: 67.912 seconds
extrapolate took: 0.162 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.000 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.008 seconds
photoz_bias_kids took: 0.007 seconds
fast_pt took: 0.266 seconds
choose_des_ia took: 0.000 seconds
IA took: 0.074 seconds
pk_to_cl_des took: 2.182 seconds
add_intrinsic took: 0.011 seconds
2pt_shear took: 0.352 seconds
shear_m_bias took: 0.006 seconds
2pt_like took: 0.028 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.094 seconds
pk_to_cl_kids took: 2.949 seconds
add_intrinsic took: 0.009 seconds
camb took: 45.674 seconds
extrapolate took: 0.167 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
cosebis took: 2.848 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.003 seconds
Total pipeline time: 76.9 seconds
photoz_bias_des took: 0.011 seconds
consistency took: 0.002 seconds
photoz_bias_kids took: 0.010 seconds
fast_pt took: 0.379 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.072 seconds
pk_to_cl_des took: 2.018 seconds
add_intrinsic took: 0.010 seconds
2pt_shear took: 0.336 seconds
shear_m_bias took: 0.006 seconds
2pt_like took: 0.027 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.092 seconds
pk_to_cl_kids took: 2.645 seconds
add_intrinsic took: 0.014 seconds
cosebis took: 5.417 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.003 seconds
Total pipeline time: 56.9 seconds
consistency took: 0.003 seconds
camb took: 58.179 seconds
extrapolate took: 0.220 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.009 seconds
photoz_bias_kids took: 0.008 seconds
fast_pt took: 0.332 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.062 seconds
pk_to_cl_des took: 1.884 seconds
add_intrinsic took: 0.007 seconds
2pt_shear took: 0.296 seconds
shear_m_bias took: 0.004 seconds
2pt_like took: 0.040 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.125 seconds
pk_to_cl_kids took: 2.611 seconds
add_intrinsic took: 0.009 seconds
camb took: 42.651 seconds
extrapolate took: 0.176 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.012 seconds
photoz_bias_kids took: 0.011 seconds
cosebis took: 2.997 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.003 seconds
Total pipeline time: 66.8 seconds
consistency took: 0.002 seconds
fast_pt took: 0.412 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.078 seconds
pk_to_cl_des took: 1.915 seconds
add_intrinsic took: 0.011 seconds
2pt_shear took: 0.344 seconds
shear_m_bias took: 0.006 seconds
2pt_like took: 0.031 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.092 seconds
pk_to_cl_kids took: 2.548 seconds
add_intrinsic took: 0.014 seconds
camb took: 61.666 seconds
extrapolate took: 0.151 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.009 seconds
photoz_bias_kids took: 0.008 seconds
fast_pt took: 0.329 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.061 seconds
pk_to_cl_des took: 1.877 seconds
add_intrinsic took: 0.006 seconds
2pt_shear took: 0.298 seconds
shear_m_bias took: 0.004 seconds
2pt_like took: 0.048 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.132 seconds
cosebis took: 5.447 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.003 seconds
Total pipeline time: 53.8 seconds
consistency took: 0.003 seconds
pk_to_cl_kids took: 2.670 seconds
add_intrinsic took: 0.009 seconds
cosebis took: 2.998 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.003 seconds
Total pipeline time: 70.3 seconds
consistency took: 0.002 seconds
camb took: 53.994 seconds
extrapolate took: 0.221 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.010 seconds
photoz_bias_kids took: 0.009 seconds
fast_pt took: 0.345 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.065 seconds
camb took: 45.454 seconds
extrapolate took: 0.192 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.012 seconds
photoz_bias_kids took: 0.011 seconds
pk_to_cl_des took: 1.926 seconds
add_intrinsic took: 0.006 seconds
fast_pt took: 0.396 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.073 seconds
2pt_shear took: 0.292 seconds
shear_m_bias took: 0.004 seconds
2pt_like took: 0.040 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.124 seconds
pk_to_cl_des took: 2.212 seconds
add_intrinsic took: 0.010 seconds
2pt_shear took: 0.341 seconds
shear_m_bias took: 0.006 seconds
2pt_like took: 0.027 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.093 seconds
pk_to_cl_kids took: 2.740 seconds
add_intrinsic took: 0.009 seconds
pk_to_cl_kids took: 2.707 seconds
add_intrinsic took: 0.014 seconds
cosebis took: 3.104 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.003 seconds
Total pipeline time: 62.9 seconds
consistency took: 0.002 seconds
cosebis took: 5.432 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.004 seconds
Total pipeline time: 57.0 seconds
consistency took: 0.003 seconds
camb took: 54.349 seconds
extrapolate took: 0.162 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.010 seconds
photoz_bias_kids took: 0.010 seconds
fast_pt took: 0.384 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.062 seconds
pk_to_cl_des took: 2.018 seconds
add_intrinsic took: 0.006 seconds
2pt_shear took: 0.289 seconds
shear_m_bias took: 0.004 seconds
2pt_like took: 0.050 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.122 seconds
pk_to_cl_kids took: 2.476 seconds
add_intrinsic took: 0.009 seconds
cosebis took: 2.838 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.003 seconds
Total pipeline time: 62.8 seconds
consistency took: 0.002 seconds
camb took: 43.026 seconds
extrapolate took: 0.162 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.011 seconds
photoz_bias_kids took: 0.011 seconds
fast_pt took: 0.390 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.074 seconds
pk_to_cl_des took: 1.841 seconds
add_intrinsic took: 0.010 seconds
2pt_shear took: 0.343 seconds
shear_m_bias took: 0.006 seconds
2pt_like took: 0.035 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.137 seconds
pk_to_cl_kids took: 2.413 seconds
add_intrinsic took: 0.014 seconds
cosebis took: 2.818 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.003 seconds
Total pipeline time: 51.3 seconds
consistency took: 0.003 seconds
camb took: 61.884 seconds
extrapolate took: 0.203 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.009 seconds
photoz_bias_kids took: 0.008 seconds
fast_pt took: 0.331 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.062 seconds
pk_to_cl_des took: 2.085 seconds
add_intrinsic took: 0.006 seconds
2pt_shear took: 0.300 seconds
shear_m_bias took: 0.004 seconds
2pt_like took: 0.041 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.109 seconds
pk_to_cl_kids took: 2.552 seconds
add_intrinsic took: 0.009 seconds
cosebis took: 3.391 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.003 seconds
Total pipeline time: 71.0 seconds
consistency took: 0.002 seconds
camb took: 59.884 seconds
extrapolate took: 0.191 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.012 seconds
photoz_bias_kids took: 0.010 seconds
fast_pt took: 0.405 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.077 seconds
pk_to_cl_des took: 1.888 seconds
add_intrinsic took: 0.011 seconds
2pt_shear took: 0.357 seconds
shear_m_bias took: 0.006 seconds
2pt_like took: 0.038 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.134 seconds
pk_to_cl_kids took: 2.813 seconds
add_intrinsic took: 0.009 seconds
cosebis took: 2.988 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.003 seconds
Total pipeline time: 68.8 seconds
consistency took: 0.002 seconds
camb took: 43.553 seconds
extrapolate took: 0.166 seconds
fits_nz_des took: 0.001 seconds
correlated_dz_priors took: 0.001 seconds
fits_nz_kids took: 0.001 seconds
photoz_bias_des took: 0.012 seconds
photoz_bias_kids took: 0.010 seconds
fast_pt took: 0.390 seconds
choose_des_ia took: 0.001 seconds
IA took: 0.073 seconds
pk_to_cl_des took: 1.639 seconds
add_intrinsic took: 0.010 seconds
2pt_shear took: 0.340 seconds
shear_m_bias took: 0.006 seconds
2pt_like took: 0.027 seconds
choose_kids_ia took: 0.001 seconds
IA took: 0.093 seconds
pk_to_cl_kids took: 2.347 seconds
add_intrinsic took: 0.014 seconds
cosebis took: 5.496 seconds
Gathering theory outputs to calculate the likelihood
cosebis stops at bin 5
cosebis_like took: 0.003 seconds
Total pipeline time: 54.2 seconds
consistency took: 0.003 seconds

Admittedly, I still don't see anything standing out, except cosebis (but not cosebis_like). I would really appreciate your insight.

And thank you for re-opening this issue!

joezuntz commented 11 months ago

@MinhMPA - thanks, this is very useful, the problem is clearly with CAMB, which is taking about 20 times longer than it should! That usually means that something is wrong with the threading - could you share the script you are using to submit to your HPC system?

joezuntz commented 8 months ago

I'll close this now as it's quiet, but if you have more to look at please get in touch.