sweverett / Balrog-GalSim

Modules for GalSim that will be useful for DES images in conjunction with Balrog.
MIT License
5 stars 6 forks source link

Fix Injection Multiprocessing #6

Closed sweverett closed 6 years ago

sweverett commented 6 years ago

balrog_injection.py has strange behavior when running using multiple processes on Fermi machines. Locally, I can run injection w/ nproc : 8 without issues; but on DES machines, running with nproc>2 quickly (~6 injections) leads to the following error:

Traceback (most recent call last):
  File "/cvmfs/des.opensciencegrid.org/eeups/fnaleups/Linux64/galsim/1.5.0alpha/bin/galsim", line 275, in <module>
    main()
  File "/cvmfs/des.opensciencegrid.org/eeups/fnaleups/Linux64/galsim/1.5.0alpha/bin/galsim", line 256, in main
    except_abort=args.except_abort)
  File "/cvmfs/des.opensciencegrid.org/eeups/fnaleups/Linux64/galsim/1.5.0alpha/lib/python2.7/site-packages/galsim/config/process.py", line 780, in Process
    except_abort=except_abort)
  File "/cvmfs/des.opensciencegrid.org/eeups/fnaleups/Linux64/galsim/1.5.0alpha/lib/python2.7/site-packages/galsim/config/output.py", line 144, in BuildFiles
    except_abort = except_abort)
  File "/cvmfs/des.opensciencegrid.org/eeups/fnaleups/Linux64/galsim/1.5.0alpha/lib/python2.7/site-packages/galsim/config/process.py", line 971, in MultiProcess
    result = job_func(**kwargs)
  File "/cvmfs/des.opensciencegrid.org/eeups/fnaleups/Linux64/galsim/1.5.0alpha/lib/python2.7/site-packages/galsim/config/output.py", line 222, in BuildFile
    data = builder.buildImages(output, config, file_num, image_num, obj_num, ignore, logger)
  File "/cvmfs/des.opensciencegrid.org/eeups/fnaleups/Linux64/galsim/1.5.0alpha/lib/python2.7/site-packages/galsim/config/output.py", line 473, in buildImages
    image = galsim.config.BuildImage(base, image_num, obj_num, logger=logger)
  File "/cvmfs/des.opensciencegrid.org/eeups/fnaleups/Linux64/galsim/1.5.0alpha/lib/python2.7/site-packages/galsim/config/image.py", line 237, in BuildImage
    image, current_var = builder.buildImage(cfg_image, config, image_num, obj_num, logger)
  File "./injector.py", line 44, in buildImage
    return super(BalrogImageBuilder, self).buildImage(config, base, image_num, obj_num, logger)
  File "./injector.py", line 14, in buildImage
    im, cv = super(AddOnImageBuilder, self).buildImage(config, base, image_num, obj_num, logger)
  File "/cvmfs/des.opensciencegrid.org/eeups/fnaleups/Linux64/galsim/1.5.0alpha/lib/python2.7/site-packages/galsim/config/image_scattered.py", line 118, in buildImage
    self.nobjects, base, logger=logger, obj_num=obj_num, do_noise=False)
  File "/cvmfs/des.opensciencegrid.org/eeups/fnaleups/Linux64/galsim/1.5.0alpha/lib/python2.7/site-packages/galsim/config/stamp.py", line 106, in BuildStamps
    except_func = except_func)
  File "/cvmfs/des.opensciencegrid.org/eeups/fnaleups/Linux64/galsim/1.5.0alpha/lib/python2.7/site-packages/galsim/config/process.py", line 939, in MultiProcess
    raise res
EOFError

However: It always runs successfully for nproc=2 ! This is baffling to me, but w/ 2 processors galaxy injection is fast enough that it hasn't been a pressing issue. Should still look into soon.

sweverett commented 6 years ago

To add to the confusion, I had a complete tile run successfully using nproc=6. I now think it may have to do with available resources on the des machine. Will try to test further.

sweverett commented 6 years ago

Now that we are doing larger runs with 10 realizations per tile, it's more important that we get this right. I've done a few more tests and found some hints. This failure mode has only occurred on Fermi machines so far but it appears to be a race condition - I ran galsim in debug mode (i.e. -v 3) and was able to finish far more chip injections before failure; ~50. Sometimes as high as a full band which is nearly 100 chips.

The extra print statements slow down the internal processing of galsim and decreases the chance of an individual race condition error. However, given enough chips, it will eventually still error. It is still unclear to me where exactly this is happening but it now seems more likely to be a hidden error rather than intrinsic issue with running on Fermi machines.

sweverett commented 6 years ago

It appears that the issue with running on Fermi machines had to do with a clobbering of software versions in setup_balrog.sh. We now have a simplified setup that works as intended. The original comment per Brian:

Ok, I've looked at this, and I think the problem is with a too-complex, conflicting product setup in setup_balrog.sh It sets up a lot of products and then setting up galsim at the end sets up conflicting versions of these. I'm not sure exactly which product is the culprit, I suspect astropy but in any case, I've made an alternate setup_balrog.sh in: /data/des61.a/data/yanny/baltest/setup_balrog.sh This uses the galsim 1.5.1tmv. With this setup I'm able to run spencer's test case (above) galsim -v 1 mp_test.yaml many times in a row without error. In fact, all the versions of galsim including 1.5.0alpha work with this setup ,so long as only one is setup and one exits and reenters the shell cleanly and re-sets-up before each try.