ratt-ru / CubiCal

A fast radio interferometric calibration suite.
GNU General Public License v2.0
18 stars 13 forks source link

Shared memory / temp file crash #200

Open IanHeywood opened 6 years ago

IanHeywood commented 6 years ago

Via James Allison, running CubiCal on an IDIA node... I'm guessing this is related to a previous run that didn't finish, but this isn't something I've seen before:

- 07:20:21 - main               [87.6/196.5 94.0/222.6 74.8Gb] Exiting with exception: OSError([Errno 17] File exists: '/dev/shm/cubical.31736/DATA:141848:162111')
 Traceback (most recent call last):
  File "/users/jallison/.local/lib/python2.7/site-packages/cubical/main.py", line 360, in main
    stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts)
  File "/users/jallison/.local/lib/python2.7/site-packages/cubical/workers.py", line 207, in run_process_loop
    return _run_single_process_loop(ms, load_model, single_chunk, solver_type, solver_opts, debug_opts)
  File "/users/jallison/.local/lib/python2.7/site-packages/cubical/workers.py", line 329, in _run_single_process_loop
    tile.release()
  File "/users/jallison/.local/lib/python2.7/site-packages/cubical/data_handler/ms_tile.py", line 1107, in release
    data.delete()
  File "/users/jallison/.local/lib/python2.7/site-packages/cubical/tools/shared_dict.py", line 136, in delete
    os.mkdir(self.path)
OSError: [Errno 17] File exists: '/dev/shm/cubical.31736/DATA:141848:162111’

Any suggestions?

Thanks.

drjamesallison commented 6 years ago

thanks Ian - I had started a clean run with a fresh measurement set, but I guess its possible that previous failures have somehow carried through to this error.

JSKenyon commented 6 years ago

@o-smirnov will likely have a better intuition for the behaviour of shared_dict.py. From my side, I suspect @IanHeywood is correct. The problem can be zombie CubiCal processes - this is an unfortunate side effect of interrupting (via ctrl-c for example) CubiCal in multiprocessing mode. So it may not be the MS at fault. I would check out htop for lingering jobs and kill them manually. If the error persists after killing all old cubical processes, @drjamesallison, could you please follow up?

Do you think there are other checks we can put in, @o-smirnov, to make the presence of the zombies more obvious to the user?

o-smirnov commented 6 years ago

Very odd. Doesn't feel like it should be zombie-related (the PID of the current process is in the pathname...), and I can see @drjamesallison was running in single-CPU mode so it's not a race condition either. If it happens again, could you post a full log please?

drjamesallison commented 6 years ago

thanks @JSKenyon and @o-smirnov. Indeed I was running in single-CPU mode. We were finding that processes were hanging when performing I/O for some tiles, for which single-CPU mode seemed to provide a quick fix (but that is another story). I was using CUBICAL to apply a pre-existing gain table from a low-res (1k chan) MeerKAT MS to a full-spectral resolution (32k chan) MS, and produce corrected residuals (-ar).

I'll re-run and post a full log if it happens again

JSKenyon commented 6 years ago

@drjamesallison Please do open an issue regarding the hanging I/O - is definitely something we would want to fix.

@o-smirnov Ah, I missed that. Not something I have hit before.

drjamesallison commented 6 years ago

@o-smirnov and @JSKenyon I did a clean run using a fresh MS and unfortunately got the same error. Attached is the log, cheers!

@JSKenyon I'll open a separate issue with regard to the hanging I/O when running multiple processes

pcal.log

JSKenyon commented 6 years ago

Ok, @o-smirnov , I am leaving this one to you. Might be related to #201, as it seems that this might be an old version.

drjamesallison commented 6 years ago

thanks guys, I will try and update my version (not sure why it was so out of date) and re-run

JSKenyon commented 6 years ago

Just to confirm, this is inside a Singularity container?

drjamesallison commented 6 years ago

Yes that's correct