I have a SLURM cluster with 50 nodes, each node having 96 CPU cores. I want to execute a job on the cluster, and the job is divided into 192 subtasks. Theoretically, I should be able to lock resources on two nodes to run these 192 tasks simultaneously. However, when I specified node41 and node42 in the nodelist, I found that parallel execution across nodes could not be achieved. Each node executed tasks with the same rank. Below are the logs:
parser = argparse.ArgumentParser(description="Read and Write example")
parser.add_argument("--input_folder", default="**", help="Input folder path")
parser.add_argument("--base_output_folder", default="**", help="Base output folder path")
parser.add_argument('--tasks', default=192, type=int,
help='total number of tasks to run the pipeline on (default: 1)')
parser.add_argument('--workers', default=-1, type=int,
help='how many tasks to run simultaneously. (default is -1 for no limit aka tasks)')
parser.add_argument('--limit', default=-1, type=int,
help='Number of files to process')
parser.add_argument('--logging_dir', default="**", type=str,
help='Path to the logging directory')
# parser.add_argument('--local_tasks', default=-1, type=int,
# help='how many of the total tasks should be run on this node/machine. -1 for all')
# parser.add_argument('--local_rank_offset', default=0, type=int,
# help='the rank of the first task to run on this machine.')
parser.add_argument('--job_name', default='**', type=str,
help='Name of the job')
parser.add_argument('--condaenv', default='vldata', type=str,
help='Name of the conda environment')
parser.add_argument('--slurm_logs_folder', default='**', type=str,
help='Path to the slurm logs folder')
parser.add_argument(
'--nodelist',
type=str,
default='node41,node42',
help='Comma-separated list of nodes (default: node41 to node49)'
)
parser.add_argument('--nodes', default=2, type=str,
help='Number of nodes to use')
parser.add_argument('--time', default='01:00:00', type=str,
help='Time limit for the job')
parser.add_argument(
'--exclude',
type=str,
help='List of nodes to exclude'
)
if __name__ == '__main__':
args = parser.parse_args()
sbatch_args = {}
if args.nodelist:
sbatch_args["nodelist"] = args.nodelist
if args.exclude:
sbatch_args["exclude"] = args.exclude
if args.nodes:
sbatch_args["nodes"] = args.nodes
pipeline = [
MINTReader(data_folder=args.input_folder, glob_pattern="*.tar", limit=args.limit),
MINTWriter(output_folder=args.base_output_folder)
]
executor = SlurmPipelineExecutor(pipeline=pipeline,
tasks=args.tasks,
workers=args.workers,
logging_dir=args.logging_dir,
partition='cpu',
sbatch_args=sbatch_args,
condaenv=args.condaenv,
time=args.time,
job_name=args.job_name,
slurm_logs_folder=args.slurm_logs_folder,
)
print(executor.run())
I have a SLURM cluster with 50 nodes, each node having 96 CPU cores. I want to execute a job on the cluster, and the job is divided into 192 subtasks. Theoretically, I should be able to lock resources on two nodes to run these 192 tasks simultaneously. However, when I specified node41 and node42 in the nodelist, I found that parallel execution across nodes could not be achieved. Each node executed tasks with the same rank. Below are the logs:
Below are my code
Below are my sbatch script: