Open delaossa opened 8 months ago
Hi Alberto @delaossa
Thanx for your interest in the code!
Let me better understand the problem, from what you report follows that it runs with MPI on 4 GPUs -- is it on a local machine of via same slurm submission but on 1 node instead of 2? In your input script are you setting 'ctx'
argument to 'mpi'
? Why do you need PYOPENCL_CTX=':'
?
Currently MPI part in synchrad
is not very developed, and we need to rework it (there is a #28 but we didn't get it finished yet), and i'm not sure it should work out of box in multi-node case. I it really depends on how SLURM handles openCL platform on the cluster -- it could work if it opens all GPU in the same platform but if each node has a separate platform with 4 devices we need to do some nesting.
Can you provide a bit more details from error and output logs?
I'm also a bit curious about your case -- 25h x 4 x A100
seems big even for coherent case. Physically for the coherent calculations you need one macro-particle per electron and for real beams this is typically too much, so for the coherent calculations I usually look at features qualitatively and only take as much particles needed to get low shot-noise.
Hi Igor,
Thank you for the fast response!
Yes, I run in one node with 4 GPUs with no problems.
I thought that I needed to set PYOPENCL_CTX=':'
to get all the GPUs running and avoid being asked which GPU to select.
But now I understand that this is unnecessary if one uses mpirun
and the 'ctx'
argument to 'mpi'
.
So, I have tried first without PYOPENCL_CTX=':'
for the 1 node, 4 GPU case and it works good as before:
stdout
Running on 4 devices
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
stderr
mpirun -n 4 python undulator_beam.py
100%|██████████| 250/250 [00:46<00:00, 5.33it/s]
100%|██████████| 250/250 [00:46<00:00, 5.35it/s]
with the 4 GPUs running at >99%.
Then, I have tried with 2 nodes, 8 GPUs and, while the previous error is gone, there is something else:
stdout
Running on 8 devices
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
Starting without device:
Starting without device:
Starting without device:
Platform: NVIDIA Corporation
Compiler: OpenCL C 1.2
Separate it_range for each track will be used
stderr
mpirun -n 8 python undulator_beam.py
Traceback (most recent call last):
File "undulator_beam.py", line 230, in <module>
main()
File "undulator_beam.py", line 217, in main
calc.calculate_spectrum(particleTracks=tracks, timeStep=ct.c * dt,
File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 176, in calculate_spectrum
self._init_raditaion(comp, nSnaps)
File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 478, in _init_raditaion
self.Data['FormFactor'] = arrcl.to_device( self.queue,
AttributeError: 'SynchRad' object has no attribute 'queue'
Traceback (most recent call last):
File "undulator_beam.py", line 230, in <module>
main()
File "undulator_beam.py", line 217, in main
calc.calculate_spectrum(particleTracks=tracks, timeStep=ct.c * dt,
File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 176, in calculate_spectrum
self._init_raditaion(comp, nSnaps)
File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 478, in _init_raditaion
self.Data['FormFactor'] = arrcl.to_device( self.queue,
AttributeError: 'SynchRad' object has no attribute 'queue'
Traceback (most recent call last):
File "undulator_beam.py", line 230, in <module>
main()
File "undulator_beam.py", line 217, in main
calc.calculate_spectrum(particleTracks=tracks, timeStep=ct.c * dt,
File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 176, in calculate_spectrum
self._init_raditaion(comp, nSnaps)
File "/gpfs/dust/maxwell/user/delaossa/software/synchrad/synchrad/calc.py", line 478, in _init_raditaion
self.Data['FormFactor'] = arrcl.to_device( self.queue,
AttributeError: 'SynchRad' object has no attribute 'queue'
100%|██████████| 125/125 [01:09<00:00, 1.79it/s]
and only one GPU does the job...
About my particular study: I get the trajectories of 1e6 particles through a 50 period undulator with 64 steps per oscillation.
Hello! I have tried @berceanu's branch https://github.com/hightower8083/synchrad/pull/28 and the situation improves:
stdout
Running on 8 devices
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
Platform: NVIDIA Corporation
Compiler: OpenCL C 1.2
Separate it_range for each track will be used
Spectrum is saved to spec_data/spectrum_incoh.h5
Separate it_range for each track will be used
Creating context with args: {'answers': [0, 0]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 1]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 5]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 6]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 2]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 3]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 4]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Creating context with args: {'answers': [0, 7]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Spectrum is saved to spec_data/spectrum_coh.h5
no error messages whatsover in stderr
mpirun -n 8 python undulator_beam.py
100%|██████████| 125/125 [01:09<00:00, 1.81it/s]
100%|██████████| 125/125 [01:08<00:00, 1.82it/s]
but only the GPUs on the first node are used.
In stdout I see 8 lines like this one Creating context with args: {'answers': [0, 7]}
with the first index always 0 and the second going from 0 to 7.
I don't know how this matches your expectations, but it seems to me that 8 mpi processes are created but they are using only the 4 GPUs of the first node.
Well, well, it's working great now with this https://github.com/hightower8083/synchrad/pull/28
I just needed to add -ppn 4
to mpirun
so it is cleat that there are 4 processes per node.
mpirun -n 8 -ppn 4 python undulator_beam.py
Thanks Angel Ferran Pousa for spotting this detail. And thank you Igor @hightower8083 for the code and the support. I'd love to talk with you about the particular study that I am dealing with.
stdout
Creating context with args: {'answers': [0, 0]}
Context created successfully on device: NVIDIA A100-SXM4-40GB
Running on 8 devices
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
ALL | GPU device: NVIDIA A100-SXM4-40GB
Platform: NVIDIA Corporation
Compiler: OpenCL C 1.2
Separate it_range for each track will be used
Great that you've figured that out, Alberto @delaossa !
Maybe the number of processes per node can also be fixed globally in the partition settings, so it will always work correctly.
I am not familiar with the flags -btype flattop --hghg
-- are these also necessary for the correct work?
I'd be interested to see if this big calculation works out and could share tips on the code use if necessary =) ping me in fbpic slack and we could chat there.
Andrei @berceanu , we should catch up and discuss completion of #28. There are few things to fix (interactive start and cpu handling), and lets merge it asap. Ping me in slack when you have time.
Thanks!
The -btype flattop --hghg
flags are arguments for undulator_beam.py
.
I will delete them from above to avoid confusion.
Thanks for the offer, Igor: I'll try to catch you in slack these days so we can discuss about this calculation.
Hello! I would like to follow up this issue with an update.
Last time I reported that synchrad
run well across multiple nodes (in the DESY Maxwell cluster) when using the -ppn
flag, e.g.
mpirun -n 8 -ppn 4 python undulator_beam.py
However, something that I didn't notice then became apparent when I increased the number of particles: The total memory allocation increases times the number of processes. So, although the processing time is reduced by this factor (the number of processes), the memory allocation increases by the same factor, which makes easy to run out of memory for a high number of particles. For example, the simulation that i was running couldn't go up to 2M particles.
Hi Alberto @delaossa
Thanks for reporting this -- it's indeed unexpected. I assume you speak or CPU RAM not the GPU memory? Because GPU memory consumption should be modest in any case as it sends track one by one so each card only needs to hold the field grid.
So the first qusetion is are you loading particles into synchrad via h5 track file (e.g. created by tracksFromOPMD
), or you give it as a list?
If it's the file method it's curious, as it should only read particles assigned to the local process: https://github.com/hightower8083/synchrad/blob/a128c41c596b67014661ea5fa9ccb0957a354744/synchrad/calc.py#L212-L216
If you are giving it a list it might be a bit confusing since it'll take a piece of the list for processing but will still need the whole list allocated for each process. This list-input way is not really made for MPI scaling I guess, but it can probably be improved to.
Could you also append an error message for the case which couldn't run?
Thanx!
Hi Igor! As you guessed, I pass the tracks as a list to Synchrad. And yes, it is the CPU RAM the one going over the top. Thank you!
OK, in this case i'd suggest to make a file and use it as an input file_tracks=
.
The file configuration is not really documented but basically it has two main groups tracks
and misc
, where tracks have groups for each particle and each has standard coordinates set, i.e. tracks/particle_number/record
, where particle_number
is an integer and record
is x
, y
,z
,ux
, uy
, uz
, w
, and it_start
, where where coordinates are 1d arrays and w
is a float number of physical electrons, and it_start
you can set to 0 if tracks have same time sampling.
There are some more parameters in misc
group, and I suggest you to check how this is organized in one of converters synchrad has, e.g. here
https://github.com/hightower8083/synchrad/blob/a128c41c596b67014661ea5fa9ccb0957a354744/synchrad/converters.py#L102-L127
I think, you may skip cdt_array
, it_range
and propagation_direction
keys as they are currently not used.
Hello @hightower8083 and all,
I have been using Synchrad recently to calculate coherent radiation of a beam through an undulator. Thank you for the code!
For my study, it has become clear that I need more macro particles to find convergence of the results, but the simulation already takes about 25 hours in a 4 x GPU (A100) node. It'd be great to be able to run across multiple nodes to use more GPUs and save some time. However, I failed on my first try and I am not sure why.
In the submission script, I simply increased the number of requested nodes and adjusted the number of mpi process to use. This is an example with 2 nodes:
The error message is not really helpful to me:
Am I forgetting anything?
Thank you for your help!