Running factor on a slurm cluster

lofar-astron / factor

Facet calibration for LOFAR

http://www.astron.nl/citt/facet-doc

GNU General Public License v2.0

19 stars 12 forks source link

Running factor on a slurm cluster #203

Closed amisk closed 7 years ago

amisk commented 7 years ago

Hi,

I tried running factor on slurm. After I figured out how to actually use slurm, I have one last problem left. According to the documentation, to use slurm I would need to write cluster_desc_file=SLURM into the [cluster] section of the factor parset. But after running it, it complains that it cannot find the file SLURM. Does this need to be a file? What has to be in it? I did not find such a file packaged with factor.

Thx.

AHorneffer commented 7 years ago

Using slurm with Factor is not implemented yet.

amisk commented 7 years ago

But Hamburg is using factor with slurm, afaik ...

Am 23.03.2017 um 10:39 schrieb AHorneffer:

Using slurm with Factor is not implemented yet.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lofar-astron/factor/issues/203#issuecomment-288665309, or mute the thread https://github.com/notifications/unsubscribe-auth/ATASQdS5HYB18goAmjFdHJM4n0jVX7Xaks5roj23gaJpZM4Ml6K3.

AHorneffer commented 7 years ago

AFAIK they use PBS. (At least that's what David had implemented, before I started adding code to at one time support slurm. (Which I stopped because running Factor on JURECA will never be very efficient.))

darafferty commented 7 years ago

Yeah, we used PBS in the past but are moving to slurm. Factor does support slurm now: @amisk, the correct parameter is clusterdesc_file, not cluster_desc_file, so that might be the cause of your problem.

amisk commented 7 years ago

@darafferty Yes, I had clusterdesc_file = SLURM in the file. Is SLURM support relatively new in factor? Our installation is maybe 6 months old.

darafferty commented 7 years ago

Yes, support for slurm was added in November, so likely your version is indeed too old.

amisk commented 7 years ago

Thank you! I upgraded to 1.1 (and installed shapely) and now it seems to run with slurm.

Since some facets require the solutions from other facets, is there a reasonable number on how may nodes it can run? And if I set the groupings parameter, will this then run parallel calibration on a single node as well, or will this be sent to the other nodes?

darafferty commented 7 years ago

If you run one direction at a time, then phase shifting, calibration, etc. will be distributed over all the nodes (imaging can only be done on a single node, though). However, the number of such jobs is limited by the number of independent time/frequency chunks that are available, so you may not be able to fully use all the nodes. In Hamburg, we generally use 12-24 cores per direction.

If you run more than one direction at a time (set with the groupings parameter), the available nodes will be divided up between them. So, 8 nodes and 2 directions at a time would mean 4 nodes per direction. To get more than one direction per node, you can set the ndir_per_node parameter. So if you have fat nodes (say 24 cores each), you could run 4 directions at once on two nodes by setting ndir_per_node = 2.