radical-cybertools / radical.repex.at

This is the github location for RepEx developed by the RADICAL team in conjunction with the York Lab.
Other
4 stars 3 forks source link

1728 replica run gets killed due to inactivity on SuperMIC #88

Open antonst opened 7 years ago

SrinivasMushnoori commented 7 years ago

Replicated for 1000 cores a well.

Email notification:

E133 - Low load The average load per node is low: 0.01 The average load should be: 20 The reverified average load per node: 0.01 The average memory usage per node: 287mb, 0% Try to use CPU and memory resources wisely. E135 - Low CPU percent The average CPU percent per node is low: 0% The average CPU percent should be: 2000% Average load per node: 0.01 The average memory usage per node: 287mb, 0% Try to use CPU and memory resources wisely.

Node statistics:: Number of nodes: 50 Number of cores: 1000 Total physical memory per node: 64364mb Average memory usage per node: 287mb, 0% Average memory usage per core: 14mb Average virtual memory usage per node: 10950mb Average virtual memory usage per core: 547mb Average CPU percent per node: 0% Average CPU percent per core: 0% Average load per node: 0.01 Reverified average load per node: 0.01 Effective maximum load on a node: 0.18

PBS_job=300081.smic3 user=smushnoo allocation=TG-MCB090174 queue=workq total_load=0.86 cpu_hours=0.25 wall_hours=1.65 unused_nodes=0 total_nodes=50 ppn=20 avg_load=0.01 avg_cpu=0% avg_mem=287mb avg_vmem=10950mb top_proc=smushnoo:radical:smic170:873M:66M:0.2hr:13% toppm=smushnoo:radical.pilot:smic170:891M:66M node_processes=80 avg_avail_mem=61463mb min_avail_mem=57696mb reverified_avg_load=0.01

SrinivasMushnoori commented 7 years ago

Will be investigating if this happens only to jobs using 8 cores for data staging or it also happens to those using just 1 core for data staging.