Problem trying to run synchrad across multiple nodes

delaossa commented 8 months ago

Hello @hightower8083 and all,

I have been using Synchrad recently to calculate coherent radiation of a beam through an undulator. Thank you for the code!

For my study, it has become clear that I need more macro particles to find convergence of the results, but the simulation already takes about 25 hours in a 4 x GPU (A100) node. It'd be great to be able to run across multiple nodes to use more GPUs and save some time. However, I failed on my first try and I am not sure why.

In the submission script, I simply increased the number of requested nodes and adjusted the number of mpi process to use. This is an example with 2 nodes:

#!/bin/bash -x
#SBATCH --job-name=synchrad
#SBATCH --partition=mpa
#SBATCH --nodes=2
#SBATCH --constraint="GPUx4&A100"
#SBATCH --time=48:00:00
#SBATCH --output=stdout
#SBATCH --error=stderr
#SBATCH --mail-type=END

# Activate environment
source $HOME/synchrad_env.sh
export PYOPENCL_CTX=':'

mpirun -n 8 python undulator_beam.py

The error message is not really helpful to me:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 140773 RUNNING AT max-mpag009
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Killed (signal 9)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Am I forgetting anything?

Thank you for your help!

hightower8083 commented 8 months ago

Hi Alberto @delaossa

Thanx for your interest in the code!

Let me better understand the problem, from what you report follows that it runs with MPI on 4 GPUs -- is it on a local machine of via same slurm submission but on 1 node instead of 2? In your input script are you setting 'ctx' argument to 'mpi'? Why do you need PYOPENCL_CTX=':'?

Currently MPI part in synchrad is not very developed, and we need to rework it (there is a #28 but we didn't get it finished yet), and i'm not sure it should work out of box in multi-node case. I it really depends on how SLURM handles openCL platform on the cluster -- it could work if it opens all GPU in the same platform but if each node has a separate platform with 4 devices we need to do some nesting.

Can you provide a bit more details from error and output logs?

I'm also a bit curious about your case -- 25h x 4 x A100 seems big even for coherent case. Physically for the coherent calculations you need one macro-particle per electron and for real beams this is typically too much, so for the coherent calculations I usually look at features qualitatively and only take as much particles needed to get low shot-noise.

delaossa commented 8 months ago

Hi Igor,

Thank you for the fast response! Yes, I run in one node with 4 GPUs with no problems. I thought that I needed to set PYOPENCL_CTX=':' to get all the GPUs running and avoid being asked which GPU to select. But now I understand that this is unnecessary if one uses mpirun and the 'ctx' argument to 'mpi'.

So, I have tried first without PYOPENCL_CTX=':' for the 1 node, 4 GPU case and it works good as before:

stdout

Running on 4 devices
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB

stderr

mpirun -n 4 python undulator_beam.py
100%|██████████| 250/250 [00:46<00:00,  5.33it/s]
100%|██████████| 250/250 [00:46<00:00,  5.35it/s]

with the 4 GPUs running at >99%.

Then, I have tried with 2 nodes, 8 GPUs and, while the previous error is gone, there is something else:

stdout

Running on 8 devices
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  Starting without device:
  Starting without device:
  Starting without device:
Platform: NVIDIA Corporation
Compiler: OpenCL C 1.2
Separate it_range for each track will be used

stderr

mpirun -n 8 python undulator_beam.py
Traceback (most recent call last):
  File "undulator_beam.py", line 230, in <module>
    main()
  File "undulator_beam.py", line 217, in main
    calc.calculate_spectrum(particleTracks=tracks, timeStep=ct.c * dt,
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 176, in calculate_spectrum
    self._init_raditaion(comp, nSnaps)
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 478, in _init_raditaion
    self.Data['FormFactor'] = arrcl.to_device( self.queue,
AttributeError: 'SynchRad' object has no attribute 'queue'
Traceback (most recent call last):
  File "undulator_beam.py", line 230, in <module>
    main()
  File "undulator_beam.py", line 217, in main
    calc.calculate_spectrum(particleTracks=tracks, timeStep=ct.c * dt,
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 176, in calculate_spectrum
    self._init_raditaion(comp, nSnaps)
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 478, in _init_raditaion
    self.Data['FormFactor'] = arrcl.to_device( self.queue,
AttributeError: 'SynchRad' object has no attribute 'queue'
Traceback (most recent call last):
  File "undulator_beam.py", line 230, in <module>
    main()
  File "undulator_beam.py", line 217, in main
    calc.calculate_spectrum(particleTracks=tracks, timeStep=ct.c * dt,
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 176, in calculate_spectrum
    self._init_raditaion(comp, nSnaps)
  File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 478, in _init_raditaion
    self.Data['FormFactor'] = arrcl.to_device( self.queue,
AttributeError: 'SynchRad' object has no attribute 'queue'
100%|██████████| 125/125 [01:09<00:00,  1.79it/s]

and only one GPU does the job...

About my particular study: I get the trajectories of 1e6 particles through a 50 period undulator with 64 steps per oscillation.

delaossa commented 8 months ago

Hello! I have tried @berceanu's branch https://github.com/hightower8083/synchrad/pull/28 and the situation improves:

stdout

Running on 8 devices
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
Platform: NVIDIA Corporation
Compiler: OpenCL C 1.2 
Separate it_range for each track will be used
Spectrum is saved to spec_data/spectrum_incoh.h5
Separate it_range for each track will be used
Creating context with args: {'answers': [0, 0]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 1]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 5]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 6]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 2]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 3]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 4]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 7]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Spectrum is saved to spec_data/spectrum_coh.h5

no error messages whatsover in stderr

mpirun -n 8 python undulator_beam.py
100%|██████████| 125/125 [01:09<00:00,  1.81it/s]
100%|██████████| 125/125 [01:08<00:00,  1.82it/s]

but only the GPUs on the first node are used. In stdout I see 8 lines like this one Creating context with args: {'answers': [0, 7]} with the first index always 0 and the second going from 0 to 7. I don't know how this matches your expectations, but it seems to me that 8 mpi processes are created but they are using only the 4 GPUs of the first node.

delaossa commented 8 months ago

Well, well, it's working great now with this https://github.com/hightower8083/synchrad/pull/28 I just needed to add -ppn 4 to mpirun so it is cleat that there are 4 processes per node.

mpirun -n 8 -ppn 4 python undulator_beam.py

Thanks Angel Ferran Pousa for spotting this detail. And thank you Igor @hightower8083 for the code and the support. I'd love to talk with you about the particular study that I am dealing with.

stdout

Creating context with args: {'answers': [0, 0]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Running on 8 devices
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
  ALL | GPU device: NVIDIA A100-SXM4-40GB
Platform: NVIDIA Corporation
Compiler: OpenCL C 1.2
Separate it_range for each track will be used

hightower8083 commented 8 months ago

Great that you've figured that out, Alberto @delaossa ! Maybe the number of processes per node can also be fixed globally in the partition settings, so it will always work correctly. I am not familiar with the flags -btype flattop --hghg -- are these also necessary for the correct work? I'd be interested to see if this big calculation works out and could share tips on the code use if necessary =) ping me in fbpic slack and we could chat there.

Andrei @berceanu , we should catch up and discuss completion of #28. There are few things to fix (interactive start and cpu handling), and lets merge it asap. Ping me in slack when you have time.

delaossa commented 8 months ago

Thanks! The -btype flattop --hghg flags are arguments for undulator_beam.py. I will delete them from above to avoid confusion.

Thanks for the offer, Igor: I'll try to catch you in slack these days so we can discuss about this calculation.

delaossa commented 7 months ago

Hello! I would like to follow up this issue with an update.

Last time I reported that synchrad run well across multiple nodes (in the DESY Maxwell cluster) when using the -ppn flag, e.g.

mpirun -n 8 -ppn 4 python undulator_beam.py

However, something that I didn't notice then became apparent when I increased the number of particles: The total memory allocation increases times the number of processes. So, although the processing time is reduced by this factor (the number of processes), the memory allocation increases by the same factor, which makes easy to run out of memory for a high number of particles. For example, the simulation that i was running couldn't go up to 2M particles.

hightower8083 commented 7 months ago

Hi Alberto @delaossa

Thanks for reporting this -- it's indeed unexpected. I assume you speak or CPU RAM not the GPU memory? Because GPU memory consumption should be modest in any case as it sends track one by one so each card only needs to hold the field grid.

So the first qusetion is are you loading particles into synchrad via h5 track file (e.g. created by tracksFromOPMD), or you give it as a list?

If it's the file method it's curious, as it should only read particles assigned to the local process: https://github.com/hightower8083/synchrad/blob/a128c41c596b67014661ea5fa9ccb0957a354744/synchrad/calc.py#L212-L216

If you are giving it a list it might be a bit confusing since it'll take a piece of the list for processing but will still need the whole list allocated for each process. This list-input way is not really made for MPI scaling I guess, but it can probably be improved to.

Could you also append an error message for the case which couldn't run?

Thanx!

delaossa commented 7 months ago

Hi Igor! As you guessed, I pass the tracks as a list to Synchrad. And yes, it is the CPU RAM the one going over the top. Thank you!

hightower8083 commented 7 months ago

OK, in this case i'd suggest to make a file and use it as an input file_tracks=.

The file configuration is not really documented but basically it has two main groups tracks and misc, where tracks have groups for each particle and each has standard coordinates set, i.e. tracks/particle_number/record, where particle_number is an integer and record is x, y,z,ux, uy, uz, w, and it_start, where where coordinates are 1d arrays and w is a float number of physical electrons, and it_start you can set to 0 if tracks have same time sampling.

There are some more parameters in misc group, and I suggest you to check how this is organized in one of converters synchrad has, e.g. here https://github.com/hightower8083/synchrad/blob/a128c41c596b67014661ea5fa9ccb0957a354744/synchrad/converters.py#L102-L127

I think, you may skip cdt_array, it_range and propagation_direction keys as they are currently not used.

hightower8083 / synchrad

Problem trying to run synchrad across multiple nodes #30