Problem utilizing GPU nodes on Stampede

haoyuanchen commented 8 years ago

Trying to run TUU simulation using GPU nodes on stampede:

I. 32 replicas, 1 core per replica, allocate 1 node (16 cores), failed II. 32 replicas, 1 core per replica, allocate 2 nodes (32 cores), finished III. 16 replicas, 1 core per replica, allocate 1 node (16 cores), finished IV. 16 replicas, 16 cores per replica, allocate 1 node (16 cores), failed V. 8 replicas, 16 cores per replica, allocate 8 nodes (128 cores), finished VI. 8 replicas, 16 cores per replica, allocate 4 nodes (64 cores), failed

In all the failed simulations, not all replicas got their necessary files for starting up transferred to the desired locations.

I've checked all the finished MD steps, they were done using CUDA code. Since one GPU node on Stampede has 16 CPUs plus only 1 GPU, then in the finished simulation (for example, case III), is it the case that the only 1 GPU finished the 16 jobs by doing one by at a time? If this is the case, then why it cannot finish 32 jobs the same way (case I)?

antonst commented 8 years ago

Hi Haoyuan,

Are you using same coordinate file for all replicas or not? Can you please post here your input file for run VI here please? Also please try the latest devel version, I have made some changes which may address this problem.

haoyuanchen commented 8 years ago

Are you using same coordinate file for all replicas or not?

Yes.

Can you please post here your input file for run VI here please?

input json file:

{ "remd.input": { "re_pattern": "S", "exchange": "TUU-REMD", "number_of_cycles": "2", "input_folder": "tuu_remd_inputs", "input_file_basename": "ace_ala_nme_remd", "amber_input": "ace_ala_nme.mdin", "amber_parameters": "ace_ala_nme.parm7", "amber_coordinates_folder": "ace_ala_nme_coors_8x8", "same_coordinates": "True", "us_template": "ace_ala_nme_us.RST", "replica_gpu": "True", "replica_cores": "16", "steps_per_cycle": "1000", "download_mdinfo": "False", "download_mdout" : "False" }, "dim.input": { "umbrella_sampling_1": { "number_of_replicas": "2", "us_start_param": "180", "us_end_param": "360", "exchange_replica_cores" : "1", "exchange_replica_mpi": "False" }, "temperature_2": { "number_of_replicas": "2", "min_temperature": "300", "max_temperature": "600", "exchange_replica_cores" : "1", "exchange_replica_mpi": "False" }, "umbrella_sampling_3": { "number_of_replicas": "2", "us_start_param": "180", "us_end_param": "360", "exchange_replica_cores" : "1", "exchange_replica_mpi": "False" }
} }

stampede.json:

{ "target": { "resource": "xsede.stampede", "username" : "chen1990", "project" : "TG-MCB110101", "queue" : "gpudev", "runtime" : "30", "cleanup" : "False", "cores" : "64" } }

Also please try the latest devel version, I have made some changes which may address this problem.

Thanks! Will try.

antonst commented 8 years ago

Is this issue still relevant?

ibethune commented 7 years ago

Guess not... -> backburner

radical-cybertools / radical.repex.at

Problem utilizing GPU nodes on Stampede #72