Open IanHeywood opened 5 years ago
Log seems difficult to read there so there's a text file uploaded here.
OK pretty sure it's this problem biting us: https://github.com/mhardcastle/ddf-pipeline/issues/150
@IanHeywood could you do /sbin/sysctl vm.max_map_count
on the node and let me know what it says?
Might also be some unholy interaction with Singularity. @SpheMakh did we have some discussion of how shared memoery is handled/restricted within Singularity containers, or was I hallucinating?
Also, df -h /dev/shm
please....
@SpheMakh could you please make a "debugging" container that runs
/sbin/sysctl vm.max_map_count
df -h /dev/shm
before invoking gocubical. That way we can see what the system limits are on the node. Actually might be a good idea to do this routinely, not just for debugging.
Actually @IanHeywood, @SpheMakh tells me you're explicitly launching CubiCal within the container with your own command -- maybe you could run those two commands inside the container yourself?
I've added those commands to each cubical run by default. It now starts logging like this:
vm.max_map_count = 65530
Filesystem Size Used Avail Use% Mounted on
tmpfs 119G 0 119G 0% /dev/shm
Traceback (most recent call last):
File "/usr/lib/python2.7/logging/__init__.py", line 868, in emit
msg = self.format(record)
File "/usr/lib/python2.7/logging/__init__.py", line 741, in format
return fmt.format(record)
File "/usr/lib/python2.7/logging/__init__.py", line 469, in format
s = self._fmt % record.__dict__
KeyError: 'shortname'
Logged from file driver.py, line 124
Traceback (most recent call last):
File "/usr/lib/python2.7/logging/__init__.py", line 868, in emit
msg = self.format(record)
File "/usr/lib/python2.7/logging/__init__.py", line 741, in format
return fmt.format(record)
File "/usr/lib/python2.7/logging/__init__.py", line 469, in format
s = self._fmt % record.__dict__
KeyError: 'shortname'
Logged from file driver.py, line 124
- 07:49:50 - main | reading defaults from /ceph/pipelines/ianh/MIGHTEE/CDFS/CDFS_2_4/parsets/phas
ecal.parset
- 07:49:50 - main | using cube_pcal_1561266559_sdp_l0.full_1284.full_pol_wtspec_CDFS_2_4.m
s_2019-07-16-23-46-02 as base for output files
Looks to be the same setup for the four nodes I've run jobs on since:
CDFS_2_3/logs/slurm_cubical1_2_3.log:tmpfs 119G 0 119G 0% /dev/shm
CDFS_2_4/logs/slurm_cubical1_2_4.log:tmpfs 119G 0 119G 0% /dev/shm
CDFS_4_3/logs/slurm_cubical1_4_3.log:tmpfs 119G 0 119G 0% /dev/shm
CDFS_4_4/logs/slurm_cubical1_4_4.log:tmpfs 119G 4.0K 119G 1% /dev/shm
ianh@slurm-login:/ceph/pipelines/ianh/MIGHTEE/CDFS$ grep count */logs/*cub*
CDFS_2_3/logs/slurm_cubical1_2_3.log:vm.max_map_count = 65530
CDFS_2_4/logs/slurm_cubical1_2_4.log:vm.max_map_count = 65530
CDFS_4_3/logs/slurm_cubical1_4_3.log:vm.max_map_count = 65530
CDFS_4_4/logs/slurm_cubical1_4_4.log:vm.max_map_count = 65530
although three of the four of them dropped dead overnight with no error message, and one ended with the following:
- 07:12:33 - solver [x27] [7.8/8.7 9.7/29.2 52.6Gb] Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/cubical/solver.py", line 833, in run_solver
corr_vis = solver_machine.run()
File "/usr/local/lib/python2.7/dist-packages/cubical/solver.py", line 656, in run
SolveOnly.run(self)
File "/usr/local/lib/python2.7/dist-packages/cubical/solver.py", line 644, in run
self.vdm.freq_slice, self.soldict)
File "/usr/local/lib/python2.7/dist-packages/cubical/machines/ifr_gain_machine.py", line 130, in update
DDH[:] = (D * DH).sum(axis=(0,1))
MemoryError
I'm going to dial down the parallelism and resubmit.
vm.max_map_count = 65530
Filesystem Size Used Avail Use% Mounted on
tmpfs 119G 0 119G 0% /dev/shm
Yeah thart's looking rather tight on both counts. Any chance to get the admins to (a) increase vm.max_map_count to 1000000, and (b) increase max shm size to 80 or 90% of RAM?
(b) I am working on as part of my efforts to get the IDIA slurm cluster to be DDFacet-friendly.
I will ask about (a).
Is it noteworthy that the original problem in this issue occurred with parallelism disabled? For the tile sizes I had it was using maybe 40 GB of the 230 GB of available RAM.
DDFacet needs a+b really...
Is it noteworthy that the original problem in this issue occurred with parallelism disabled?
I'm not sure.... shared memory use goes up by a factor of 2 in parallel mode (because of the tile read-ahead thingy), so it would probably have been even worse.
OK thanks to Mike Currin and others I have a test worker node that has 192 GB of /dev/shm
space and vm.max_map_count = 1000000
.
DDF tests are ahead of CubiCal in the queue but I'll let you know.
I've also reduced the max tiles in the above runs and will see how those go on the general worker nodes.
Cheers.
The other curious thing about this run is that every tile had exactly 50% convergence.
I'm using the containerised version from @SpheMakh's stimela dockerhub.
Full log below.