simonsobs / nemo

Millimeter-wave map filtering and Sunyaev-Zel'dovich galaxy cluster/source detection package. Originally developed for the Atacama Cosmology Telescope project.
https://nemo-sz.readthedocs.io
BSD 3-Clause "New" or "Revised" License
7 stars 5 forks source link

Nemomodel gives me OverflowError: integer does not fit in 'int' #73

Open Saladino93 opened 4 months ago

Saladino93 commented 4 months ago

Hi all. I am a new user running on Perlmutter.

On running

srun -u -l -n 64 nemoModel "/pscratch/sd/o/omard/FGSIMS_OUT/agora/${nemo_run}/${nemo_run}_optimalCatalog.fits" $mask $beam "/pscratch/sd/o/omard/FGSIMS_OUT/agora/${nemo_run}/nemomodel_${freq}_snr4.fits" --min-snr 4.0 --freq $freq -M -n"

(note I added by hand the min-snr argument)

I get

54:   File "mpi4py/MPI/Comm.pyx", line 1406, in mpi4py.MPI.Comm.send
54:   File "mpi4py/MPI/msgpickle.pxi", line 211, in mpi4py.MPI.PyMPI_send
54:   File "mpi4py/MPI/msgpickle.pxi", line 147, in mpi4py.MPI.pickle_dump
54:   File "mpi4py/MPI/msgbuffer.pxi", line 50, in mpi4py.MPI.downcast
54: OverflowError: integer 3566595060 does not fit in 'int'

even if

54: ... rank 54 image complete (took 1895.205 sec)
54: ... rank = 54 sending sky model image

Any ideas how to debug this? I thought it might be related to my survey mask, but I still keep getting this even after reducing the area.

Thanks in advance.

mattyowl commented 4 months ago

Hi - I'm assuming you're running the 'dev' branch? If so, this would probably be due to me trying to save memory, which didn't work out (caused more problems than it solved), and so I fixed this at the weekend. So I think if you just pull from 'dev', this should go away. Please let me know if not.

Saladino93 commented 4 months ago

I installed through pip. Let me see if using the 'dev' branch improves the situation. Thanks.

mattyowl commented 4 months ago

Ok - it's unlikely to be what I said then, but I'm not sure what the issue would be without more info. Maybe you could post the whole traceback?

Saladino93 commented 4 months ago

Indeed I ran without the saving model hack (that converts to a np.float16). I am running now with it and waiting for the results.

This is what I get from my previous pip installation:

19: Traceback (most recent call last):
 19:   File "/global/homes/o/omard/.conda/envs/act/bin/nemoModel", line 240, in <module>
 19:     comm.send(modelImage, dest = 0)
 19:   File "mpi4py/MPI/Comm.pyx", line 1406, in mpi4py.MPI.Comm.send
 19:   File "mpi4py/MPI/msgpickle.pxi", line 211, in mpi4py.MPI.PyMPI_send
 19:   File "mpi4py/MPI/msgpickle.pxi", line 147, in mpi4py.MPI.pickle_dump
 19:   File "mpi4py/MPI/msgbuffer.pxi", line 50, in mpi4py.MPI.downcast
 19: OverflowError: integer 3566595060 does not fit in 'int'

(note that I clone the mpi4py environment of Perlmutter)

Saladino93 commented 4 months ago

Ok, I actually manage to run by doing

print("Saving memory by converting to float16 before applying pixel window function...")
        modelMap=np.float16(modelMap) #NOTE: this is a bit of a hack to save memory

The total file size is 3.2 GB. Does this make sense to you?

I am not sure if this is due to some limitation on Perlmutter (doubt it), mpi4py, or something else (perhaps I ran my initial PS search wrongly...).

mattyowl commented 4 months ago

That's a mystery to me, because I've taken that out as I mentioned above. I don't think I've managed to get the OverflowError you've been getting, running on the sims I've been making or the real data.