Closed amisk closed 7 years ago
Using slurm with Factor is not implemented yet.
But Hamburg is using factor with slurm, afaik ...
Am 23.03.2017 um 10:39 schrieb AHorneffer:
Using slurm with Factor is not implemented yet.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lofar-astron/factor/issues/203#issuecomment-288665309, or mute the thread https://github.com/notifications/unsubscribe-auth/ATASQdS5HYB18goAmjFdHJM4n0jVX7Xaks5roj23gaJpZM4Ml6K3.
AFAIK they use PBS. (At least that's what David had implemented, before I started adding code to at one time support slurm. (Which I stopped because running Factor on JURECA will never be very efficient.))
Yeah, we used PBS in the past but are moving to slurm. Factor does support slurm now: @amisk, the correct parameter is clusterdesc_file
, not cluster_desc_file
, so that might be the cause of your problem.
@darafferty Yes, I had clusterdesc_file = SLURM in the file. Is SLURM support relatively new in factor? Our installation is maybe 6 months old.
Yes, support for slurm was added in November, so likely your version is indeed too old.
Thank you! I upgraded to 1.1 (and installed shapely) and now it seems to run with slurm.
Since some facets require the solutions from other facets, is there a reasonable number on how may nodes it can run? And if I set the groupings parameter, will this then run parallel calibration on a single node as well, or will this be sent to the other nodes?
If you run one direction at a time, then phase shifting, calibration, etc. will be distributed over all the nodes (imaging can only be done on a single node, though). However, the number of such jobs is limited by the number of independent time/frequency chunks that are available, so you may not be able to fully use all the nodes. In Hamburg, we generally use 12-24 cores per direction.
If you run more than one direction at a time (set with the groupings
parameter), the available nodes will be divided up between them. So, 8 nodes and 2 directions at a time would mean 4 nodes per direction. To get more than one direction per node, you can set the ndir_per_node
parameter. So if you have fat nodes (say 24 cores each), you could run 4 directions at once on two nodes by setting ndir_per_node
= 2.
Hi,
I tried running factor on slurm. After I figured out how to actually use slurm, I have one last problem left. According to the documentation, to use slurm I would need to write cluster_desc_file=SLURM into the [cluster] section of the factor parset. But after running it, it complains that it cannot find the file SLURM. Does this need to be a file? What has to be in it? I did not find such a file packaged with factor.
Thx.