Open hjjvandam opened 6 months ago
Hi Hub,
when running that config file, I see the following resource description being used in this line:
{'access_schema': 'local',
'cpus': 1024,
'gpus': 64,
'project': 'CHM136_crusher',
'queue': 'batch',
'resource': 'ornl.crusher',
'walltime': 180}
so that seems to indicate that indeed 1k cores are being allocated. So unfortunately the plotting is correct, the resource allocation is faulty.
Thanks Andre,
I will have to go and track that down. There are some other weird things going on in that department anyway.
Best wishes,
Huub
Hubertus van Dam, 631-344-6020, @.**@.> Brookhaven National Laboratory
From: Andre Merzky @.> Date: Friday, December 22, 2023 at 8:35 AM To: radical-cybertools/radical.analytics @.> Cc: Van Dam, Hubertus @.>, Author @.> Subject: Re: [radical-cybertools/radical.analytics] Plots show excessive amounts of resources (Issue #187)
Hi Hub,
when running that config file, I see the following resource description being used in this linehttps://github.com/hjjvandam/DeepDriveMD-pipeline/blob/feature/nwchem/deepdrivemd/deepdrivemd.py#L275:
{'access_schema': 'local',
'cpus': 1024,
'gpus': 64,
'project': 'CHM136_crusher',
'queue': 'batch',
'resource': 'ornl.crusher',
'walltime': 180}
so that seems to indicate that indeed 1k cores are being allocated. So unfortunately the plotting is correct, the resource allocation is faulty.
— Reply to this email directly, view it on GitHubhttps://github.com/radical-cybertools/radical.analytics/issues/187#issuecomment-1867696090, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABDS7HTSRTLQGJ6ZONPYAILYKWEARAVCNFSM6AAAAABA7BF6D2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRXGY4TMMBZGA. You are receiving this because you authored the thread.Message ID: @.***>
I am running some workflows on Crusher. The stage with the largest number of tasks runs 64 of them, each using 1 CPU core. The performance analysis plots suggest, however, that around 1000 cores were reserved for this workflow. With 64 CPU cores and 4 GPUs per node you only get this if the node allocation would correspond to 1 GPU per task. I.e. reserving 16 nodes for 64 single core tasks. I hope that the code isn't actually doing that and that just the plotting is off.
The performance data is stored at
I have copied the performance plots into the same directory.
The versions of the RADICAL Cybertools packages are:
The code I am running lives at
In branch
feature/nwchem
. The job I am running is specified in https://github.com/hjjvandam/DeepDriveMD-pipeline/blob/feature/nwchem/test/bba/molecular_dynamics_workflow_nwchem_test/config.yaml. Let me know if you need any further information, please.