Resource configuration file for the Laplace HPC machine

victor-malagon commented 4 months ago

Hello,

I am working on FACTS together with @AlexReedy and I need a resource configuration file for the HPC that I am using (laplace). We have a project deadline coming up soon where we'll need FACTS up and running on our HPC, so we would be really happy if you could consider our request as soon as possible.

Best regards,

Victor

andre-merzky commented 4 months ago

Thanks for opening this ticket. Can you please provide a link to the system documentation for Laplace?

AlexReedy commented 4 months ago

Hey @andre-merzky thanks for making this a priority for us! We have an end of June deadline for this project. As far as the process goes I can aid @victor-malagon on the installation and set up of FACTS with the new config once we get everything in order as well.

shantenujha commented 4 months ago

Hi Alex, Do you have a link or details about Laplace?

victor-malagon commented 4 months ago

Hi @andre-merzky and @shantenujha,

Thanks for taking care of this. This is the only info I have right now, I can contact the HPC manager to get more or a related link if necessary.

The Laplace Cluster is a High Performance Computing cluster consisting of 3716 Xeon Processor Cores. The Operating system is Linux (Red Hat Enterprise 7.9 ). The Laplace cluster is organized as follows:

no60: 1 master (login) node
no61-72: 12 compute nodes (each 96GB memory, 40 CPU cores )
no74: 1 GPU node (96GB memory, 28 cores) with Nvidia Tesla P100 (3584 cores )
no75-86: 12 High memory compute nodes (each 1024GB memory, 52 CPU cores )
no87-98: 12 Moderate memory compute nodes (each 512GB memory, 52 CPU cores )
no99-100: 2 GPU nodes (192GB memory, 52 cores) with Nvidia Tesla V100 (5120 cores ) The compute nodes are equipped with 2x Intel Skylake Gold 6138 or 2x Xeon Gold 6230R CPU's. For best possible performance on this Xeon processor applications should be compiled with processor-specific options (AVX512 related). In total 1.3PB of shared NFS storage is available, of which approximately 10% is fast SSD storage.

Usually I run experiments on the 75-98 nodes.

Victor

andre-merzky commented 4 months ago

Thanks for the information, @victor-malagon . If there is no more information, please allow me to ask a couple of questions:

what scheduler is used on the system? What batch queues are defined?
what Python versions are available - and do I need a module load command to use them?
What MPI versions are being used?
What is the path to the user home directories? Are those shared on the compute nodes?
Is there some shared scratch directory configured?

That the cluster is heterogeneous is difficult to handle for RP. I suggest that we create individual resource descriptions for each node type. How are individual node types requested - by using specific batch queues?

Thanks, Andre.

victor-malagon commented 4 months ago

Hi Andre. Let me check this with the Network Administrator at my workplace, I have basic knowledge on HPC systems and I am not sure how to answer some of these questions. I'll get back to you as soon as possible.

Thanks, Victor

AlexReedy commented 4 months ago

Hi @shantenujha unfortunately I do not, but I think @victor-malagon's conversation with the network admin can provide that Best, Alex

victor-malagon commented 4 months ago

Hi @andre-merzky,

This is the info I got from the Network Administrator:

_We have several MPI implementations. You choose a module with the ‘load command‘. ‘module list’ will give you a list of available modules.

The scheduler we use is slurm 18.08.8 We have several python versions selectable via anaconda We use openmpi3, but others are available Home directories are available on the compute nodes (same path) We have several scratch/data directories available

We have several partitions depending on hardware specs like number of cores and memory size. All nodes in a partition have the same hardware specs._

Let me know if there are further questions.

Victor

andre-merzky commented 4 months ago

Sorry for the late response, I was offline last week.

Thanks for the information - alas the response is a bit too general. Is there a way to get contact to the admins directly, or find some cluster user guide online? If not, can you please try to inquire about the following details:

DNS Name / IP address of the cluster's login node
Name of the partition you want to use (node 75-86)
name of the queue to submit to to access that partition (triple 'to'! ;-) )
path to $HOME
path to the user's scratch dir
the specific module load command we would need to use to get a usable Python version and an MPI version FACTS can use

Thanks, Andre.

victor-malagon commented 4 months ago

Hi Andre,

you can contact the network administrator here: jan.derksen@nioz.nl

Let me know if they take a while or you need more info from me.

Cheers,

Victor

andre-merzky commented 4 months ago

Thanks @victor-malagon , will do

andre-merzky commented 3 months ago

@victor-malagon : please have a look at this pull request: https://github.com/radical-cybertools/radical.pilot/pull/3194. Would you mind giving it a try? The only missing information is about scratch space - can you find out if $SCRATCH is set? Or is there another way to determine the location of the user's scratch space?

andre-merzky commented 3 months ago

@victor-malagon : Please let us know if we can help with testing.

andre-merzky commented 3 months ago

Closing until user feedback.

victor-malagon commented 2 months ago

Hi @andre-merzky, sorry for the silence, we were busy with deadlines here. I'll give it a try with @AlexReedy once we're all back in office and let you know how it goes.

andre-merzky commented 2 months ago

Not a problem, we all know how that works :-) Please re-open the issue once you got a chance to look at it.

radical-cybertools / radical.pilot

Resource configuration file for the Laplace HPC machine #3177