Plots show excessive amounts of resources

hjjvandam commented 6 months ago

I am running some workflows on Crusher. The stage with the largest number of tasks runs 64 of them, each using 1 CPU core. The performance analysis plots suggest, however, that around 1000 cores were reserved for this workflow. With 64 CPU cores and 4 GPUs per node you only get this if the node allocation would correspond to 1 GPU per task. I.e. reserving 16 nodes for 64 single core tasks. I hope that the code isn't actually doing that and that just the plotting is off.

The performance data is stored at

/lustre/orion/world-shared/chm136/re.session.login2.hjjvd.019706.0000

I have copied the performance plots into the same directory.

The versions of the RADICAL Cybertools packages are:

(pydeepdrivemd) [hjjvd@login2.crusher test]$ pip list | grep radical
radical.analytics            1.43.0
radical.entk                 1.43.0
radical.gtod                 1.43.0
radical.pilot                1.43.0
radical.saga                 1.43.0
radical.utils                1.44.0

The code I am running lives at

git@github.com:hjjvandam/DeepDriveMD-pipeline.git

In branch feature/nwchem. The job I am running is specified in https://github.com/hjjvandam/DeepDriveMD-pipeline/blob/feature/nwchem/test/bba/molecular_dynamics_workflow_nwchem_test/config.yaml. Let me know if you need any further information, please.

andre-merzky commented 6 months ago

Hi Hub,

when running that config file, I see the following resource description being used in this line:

{'access_schema': 'local',
 'cpus': 1024,
 'gpus': 64,
 'project': 'CHM136_crusher',
 'queue': 'batch',
 'resource': 'ornl.crusher',
 'walltime': 180}

so that seems to indicate that indeed 1k cores are being allocated. So unfortunately the plotting is correct, the resource allocation is faulty.

hjjvandam commented 6 months ago

Thanks Andre,

I will have to go and track that down. There are some other weird things going on in that department anyway.

Best wishes,

Huub

Hubertus van Dam, 631-344-6020, @.**@.> Brookhaven National Laboratory

From: Andre Merzky @.> Date: Friday, December 22, 2023 at 8:35 AM To: radical-cybertools/radical.analytics @.> Cc: Van Dam, Hubertus @.>, Author @.> Subject: Re: [radical-cybertools/radical.analytics] Plots show excessive amounts of resources (Issue #187)

Hi Hub,

when running that config file, I see the following resource description being used in this linehttps://github.com/hjjvandam/DeepDriveMD-pipeline/blob/feature/nwchem/deepdrivemd/deepdrivemd.py#L275:

{'access_schema': 'local',

'cpus': 1024,

'gpus': 64,

'project': 'CHM136_crusher',

'queue': 'batch',

'resource': 'ornl.crusher',

'walltime': 180}

so that seems to indicate that indeed 1k cores are being allocated. So unfortunately the plotting is correct, the resource allocation is faulty.

— Reply to this email directly, view it on GitHubhttps://github.com/radical-cybertools/radical.analytics/issues/187#issuecomment-1867696090, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABDS7HTSRTLQGJ6ZONPYAILYKWEARAVCNFSM6AAAAABA7BF6D2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRXGY4TMMBZGA. You are receiving this because you authored the thread.Message ID: @.***>

radical-cybertools / radical.analytics

Plots show excessive amounts of resources #187