Closed sweverett closed 6 years ago
To add to the confusion, I had a complete tile run successfully using nproc=6
. I now think it may have to do with available resources on the des machine. Will try to test further.
Now that we are doing larger runs with 10 realizations per tile, it's more important that we get this right.
I've done a few more tests and found some hints. This failure mode has only occurred on Fermi machines so far but it appears to be a race condition - I ran galsim
in debug mode (i.e. -v 3
) and was able to finish far more chip injections before failure; ~50. Sometimes as high as a full band which is nearly 100 chips.
The extra print statements slow down the internal processing of galsim
and decreases the chance of an individual race condition error. However, given enough chips, it will eventually still error. It is still unclear to me where exactly this is happening but it now seems more likely to be a hidden error rather than intrinsic issue with running on Fermi machines.
It appears that the issue with running on Fermi machines had to do with a clobbering of software versions in setup_balrog.sh
. We now have a simplified setup that works as intended. The original comment per Brian:
Ok, I've looked at this, and I think the problem is with a too-complex, conflicting product setup in setup_balrog.sh It sets up a lot of products and then setting up galsim at the end sets up conflicting versions of these. I'm not sure exactly which product is the culprit, I suspect astropy but in any case, I've made an alternate setup_balrog.sh in: /data/des61.a/data/yanny/baltest/setup_balrog.sh This uses the galsim 1.5.1tmv. With this setup I'm able to run spencer's test case (above) galsim -v 1 mp_test.yaml many times in a row without error. In fact, all the versions of galsim including 1.5.0alpha work with this setup ,so long as only one is setup and one exits and reenters the shell cleanly and re-sets-up before each try.
balrog_injection.py
has strange behavior when running using multiple processes on Fermi machines. Locally, I can run injection w/nproc : 8
without issues; but on DES machines, running withnproc
>2 quickly (~6 injections) leads to the following error:However: It always runs successfully for
nproc
=2 ! This is baffling to me, but w/ 2 processors galaxy injection is fast enough that it hasn't been a pressing issue. Should still look into soon.