Maximizing GPU utilization

thomasaarholt commented 4 years ago

How do users monitor GPU utilization, and what parameters are possible to change to maximize utilization? I am lucky enough to run on a system with 4 RTX2080TIs, but I don't know if they are being fully utilized. In fact, the tool nvidia-smi, which reports back on the GPU states, seems to think that the GPUs are not very busy at all, under "GPU-Util"

Here's a snapshot during a small simulation, that takes about 5 minutes:

Thu Aug 27 18:13:28 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.152.00   Driver Version: 418.152.00   CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:18:00.0 Off |                  N/A |
| 31%   42C    P2    63W / 250W |    231MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 31%   42C    P2    53W / 250W |    231MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:86:00.0 Off |                  N/A |
| 30%   41C    P2    71W / 250W |    231MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:AF:00.0 Off |                  N/A |
| 27%   37C    P2    55W / 250W |    231MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     66736      C   /home/thomasaar/.conda/envs/py3/bin/python   221MiB |
|    1     66736      C   /home/thomasaar/.conda/envs/py3/bin/python   221MiB |
|    2     66736      C   /home/thomasaar/.conda/envs/py3/bin/python   221MiB |
|    3     66736      C   /home/thomasaar/.conda/envs/py3/bin/python   221MiB |
+-----------------------------------------------------------------------------+

Is there anything I should be thinking about to improve performance?

matkraj commented 4 years ago

I am not sure about recent updates, but in the past I have always improved the performance massively if I switched CPU processing off completely. I was also only able to do it when I run the console version.

lerandc commented 4 years ago

there are a lot of ways to tune the GPU utilization, but is a little dependent on what exactly you are running. i typically monitor nvidia-smi in the same way you did (often with something like watch "free -m && nvidia-smi" so i can track both GPU and CPU RAM usage in real time)

in general, we have the following control options to tune performance with Prismatic (all options can be seen with prismatic --help on the CLI version):

--num-threads (-j) value : number of CPU threads to use
--num-streams (-S) value : number of CUDA streams to create per GPU
--num-gpus (-g) value : number of GPUs to use. A runtime check is performed to check how many are actually available, and the minimum of these two numbers is used.
--batch-size (-b) value : number of probes/beams to propagate simultaneously for both CPU and GPU workers.
--batch-size-cpu (-bc) value : number of probes/beams to propagate simultaneously for CPU workers.
--batch-size-gpu (-bg) value : number of probes/beams to propagate simultaneously for GPU workers.
--also-do-cpu-work (-C) bool : boolean value used to determine whether or not to also create CPU workers in addition to GPU ones 
--streaming-mode 0/1 : boolean value to force code to use (true) or not use (false) streaming versions of GPU codes. The default behavior is to estimate the needed memory from input parameters and choose automatically.

if you have enough memory on each GPU unit to contain the memory of the entire S-matrix (for PRISM) or the potential array (for multislice), the first option i would change is to ensure that the --streaming-mode is set to false, which puts prismatic in single transfer mode, meaning that it transfers all the data needed for the GPU calculation at once to the GPU device. this is probably already done in your case, since you mentioned that your simulation is small.

the next thing I would recommend would be to set -C to false, as well (as mentioned by @matkraj), the CPU workers can be super slow and aren't really worth it.

the last settings to change are --num-gpus, --num-streams, and --batch-size (in order of signficance). if prismatic sees all the devices properly, it should split the load between GPUs relatively evenly-- this is controlled with num-gpus. parallelzing across devices is more optimal than within a single device because CUDA compute GPUs either need to establish virtualized contexts to run concurrent kernels "perfectly" parallel, or, it needs to internally swich contexts and evaluate kernel calls serially.

num-streams, in this scenario, controls the parallelization within a single device (GPU equivalent to num-threads for CPU) by creating new workers. this costs extra GPU memory (linearly scaling with number of streams), and seems to stop getting better after like 4-5 in my own experience (probably cause of the context switching issues, but also because memory copying is slow between the host and device compared to some compute steps).

batch size (for both CPU and GPU) controls how many probes a worker thread or stream evaluates before asking for more memory to be transferred from the host. if you have really fast propagation, then this could be limiting, but i've mostly adjusted this in the past to make simulations fit better on the GPU ram.

all of this discussion is within the context of a single frozen phonon. if you need to run a ton of frozen phonons and have multiple GPU devices, it's also probably faster to have each simulation be run on a single device and then average the results in post. this is mostly easily achieved by setting the number/ID of visible CUDA devices (done through environment variables); for using pyprismatic, the easiest workaround is probably to use some parallel library like joblib and create a wrapper function which spawns the pyprismatic sim processes under different environemnts

lerandc commented 4 years ago

that's a big dump of info-- I hope it helps clarify a bit of the options involved with the simulation configuration with respect to hardware optimization, though most of this is automatically configured at the start of the simulation to generally decent settings anyway

ericpre commented 4 years ago

@thomasaarholt, you snapshot shows that all GPUs are being used at 30%, is it the case most of the time or just at this specific moment? If the former, this sounds fairly surprising, but without any details on the simulation parameters, it is impossible to figure out what could the cause... I usually do a quick convergence serie of the parameters mentioned by @lerandc with the a few probes in case of multislice (so it doesn't long) and you can get some significant improvement over the automatically determined parameters. As already clearly explained by @lerandc, usually this is around saturating the GPU memory and avoiding streaming.

I haven't much simulation saving the 4D output, but I noticed that the simulation were significantly slower (5-10x) and a quick profiling showed that the GPU were idle most of the time, which is not the case at all without the 4D output... @lerandc could it be possible that there is a lock somewhere which is not released when writing 4D output and would therefore hold the computation? I should have an minimum example reproducing this issue, I can find it in my data if this is useful.

thomasaarholt commented 4 years ago

@thomasaarholt, you snapshot shows that all GPUs are being used at 30%, is it the case most of the time or just at this specific moment? If the former, this sounds fairly surprising, but without any details on the simulation parameters, it is impossible to figure out what could the cause...

I believe the 30% is referring to the GPU fan speed...? The utilization is reading as 0%. There's a decent chance this is related to the 4D slowdown that you observe. I was including 4D output in this calculation.

ericpre commented 4 years ago

I believe the 30% is referring to the GPU fan speed...? The utilization is reading as 0%.

Indeed! 🤦‍♂️, I guess I have been spoiled by using https://github.com/wookayin/gpustat 😅

There's a decent chance this is related to the 4D slowdown that you observe. I was including 4D output in this calculation.

That should be very quick to check and confirm.

thomasaarholt commented 4 years ago

Confirmed 4D slowdown. Timing in a .sh script with time: prismatic -i n10_Prismatic_RT.xyz -r 0.05 takes 15 seconds for full execution prismatic -i n10_Prismatic_RT.xyz -r 0.05 -4D true is much slower. The slowdown is during the Computing Probe Position #.../145161 part. I stopped it after it had completed 30% of the probe positions. This took 8 min, which translates to the 3D-only being 100x faster than with 4D.

Model: n10_Prismatic_RT.zip

This slowdown also reduces my GPU utilization to 0. This explain the slowdown observed in my first post, and reduces my confusion significantly. Thanks @ericpre for introducing me to gpustat! gpustat -cp --watch is way nicer than nvidia-smi with watch.

lerandc commented 4 years ago

ah yes, the 4D write is definitely extremely slow- for a couple of reasons. currently, prismatic never holds onto the full 4D output in memory at any single point in time. this is mostly because the 4D arrays can be huge, of course! if we were to run a simulation with as many probes as you have there, with a final output resolution of 256x256, the array would occupy almost 40 gigabytes in ram--probably much more than everything else needed in the simulation!

instead, what we do is to save the 4D output for each probe as soon as the propagation finishes-- back before v1.2, this was just a massive dump of .mrc files for each frozen phonon, which, while extremely cumbersome (especially considering FP averaging would be a post-processing step), is pretty fast. when we moved to HDF5, we maintained the same idea of saving at the end of each calculation-- though now, we have a shared output resource that all calculation threads must access, so we have a write process that looks like this for a 4D simulation (as of v1.2.1, slightly different in my development branch):

per probe (controlled by a thread): 1) calculate the output 2) integrate into your standard non-4D outputs 3) create a new array, and copy a cropped (and fft shifted) 4D output to it (cropped to at least within the antialiasing aperture to prevent 4x unnecessary file bloat) 4) read in the the current state of the Rx, Ry CBED pattern (HDF5 files read out zeros if unitialized) 5) restride your CBED from Qy, Qx to Qx, Qy 6) average restrided CBED against current state 7) write result of step 6) to file

all of steps 3 to 7 are accomplished with CPU resources (either worker thread or the host thread for a GPU stream), and all of steps 4 to 7 are mutex locked such that only a single worker thread can access the HDF5 file at once. that is to say, only a single CBED is ever written to disk at once and it is serially done across all worker threads.

steps 2, 3, 5, and 6 are fast in all scenarios steps 4 and 7 are not necessarilly fast, but they do require some time, especially if the output array is large (especially because the data set access is created each time) step 1 is slow in the grand scheme of things for high resolution, highly accurate simulations with large cells, so we hope to have in the end a time ranking like this 1 >> 4, 7 >> 2, 3, 5, 6

the desired behavior is such that when each thread finishes its calculation, you have some sort of race for the HDF5 resource assuming all threads get there about the same time. each thread must wait its turn, so the result is that the threads are now offset from each other in real time with respect to calculation progress, and when they reach the IO resources the next time they try to output data, they no longer have to wait to access it.

however, in a case like your experiment where it seems like we have the following order (based off of the impressive 15 sec run time for the 3D case) 4, 7 >> 1 >> 2, 3, 5, 6 this offsetting scheme is ruined and we have a huge worker thread traffic jam, like you saw

I think this could probably be "easily" solved with parallel HDF5-- after all, the probes are inherently thread-safe operations since they operate in parts of real space, such that the file IO is truly parallel. I haven't been able to invest time into figuring it out, though. the documentation for it all is also hard to parse through and there is not much (recent) discussion of implementing such features on either the HDF5 forums or forums like stack overflow. I'm pretty sure h5py supports parallel IO, so there might be some worth at some point investigating their implementation of it

ericpre commented 4 years ago

Thanks @lerandc for the information. Even if the dataset is very large, it sounds surprising that the bottleneck is with reading/writing dataset because in @thomasaarholt case, 8 min sounds far too long! This could be check easily by cropping the 4D dataset at 0.1 mrad for example, so it is still simulating 4D dataset but the size of the dataset is very small. @thomasaarholt, can you check this?

thomasaarholt commented 4 years ago

Yep (take a look at the calls below to make sure I interpreted correctly what you asked me to do). The following took 1 min 6 sec and 1 min 5 sec, respectively. In the second one I crop the output to 0.1 mrad with -4DA 0.1 and -4DC true.

prismatic -i cells/n10_Prismatic_RT.xyz -r 0.1 -4D true prismatic -i cells/n10_Prismatic_RT.xyz -r 0.1 -4D true -4DC true -4DA 0.1

lerandc commented 4 years ago

I haven't done a deep profile of how much time is spent where, but I suspect it's not so dependent in this case on the size of the array but the overhead of opening, accessing, and selecting components of the HDF5 file stream. the steps involved in 1.2.1 implementation include:

acquire write lock, then 1) opening the group where dataset resides 2) opening the dataset 3) grabbing the memory space and file space 4) selcting the relevant region of the memory space's hyperslab 5) performing IO operations 6) closing file space, memory space 7) flushing and closing dataset 8) flushing and closing datagroup 9) flushing output file

write lock released

to be honest, I don't remember why all of these steps are necessary-- I forget the testing that went on when I implemented this last year. edit: or even if all of them are

the results @thomasaarholt just dropped in seem to support that it is overhead limited in this scenario

lerandc commented 4 years ago

the model above is mostly used in my dev branch too but perhaps can be improved upon a good bit by keeping the dataset and memory spaces themeselves as persisting objects instead in the parameter class instead of just the filestream. this is formally "less safe" but probably safe enough for the calculation conditions we have

ericpre commented 4 years ago

Yep (take a look at the calls below to make sure I interpreted correctly what you asked me to do). The following took 1 min 6 sec and 1 min 5 sec, respectively. In the second one I crop the output to 0.1 mrad with -4DA 0.1 and -4DC true.

prismatic -i cells/n10_Prismatic_RT.xyz -r 0.1 -4D true prismatic -i cells/n10_Prismatic_RT.xyz -r 0.1 -4D true -4DC true -4DA 0.1

Not sure which your are using and if the -4DA 0.1 is any different from the default non crop value, because I suspect 0.1 may be parsed as 100 mrad. A quick check on the size of the file should tell if the data was actually cropped or not. Anyway, to make sure we all talking about the same, I have reproduced with the https://github.com/prism-em/prismatic/tree/dev branch and the results are similar: 3D output: 34 s 4D output: 3 min 37 s 4D output crop: 3 min 41 s

I have used multislice (to avoid the calculation of the S-matrix on the CPU) and disable the CPU workers. test_4D_dev0.zip

I haven't done a deep profile of how much time is spent where, but I suspect it's not so dependent in this case on the size of the array but the overhead of opening, accessing, and selecting components of the HDF5 file stream.

In the example above the timing difference between 3D and 4D output per pixel is 5 ms ((223-34)/(191*191)) which could indeed be related to the the overhead, even if this is still quite high! There is actually no need to write to file at the end of each probe and it could be done in batches, since it should be possible to fit many probe into memory.

I think this could probably be "easily" solved with parallel HDF5-- after all, the probes are inherently thread-safe operations since they operate in parts of real space, such that the file IO is truly parallel. I haven't been able to invest time into figuring it out, though. the documentation for it all is also hard to parse through and there is not much (recent) discussion of implementing such features on either the HDF5 forums or forums like stack overflow. I'm pretty sure h5py supports parallel IO, so there might be some worth at some point investigating their implementation of it

Maybe this library could be useful: https://github.com/BlueBrain/HighFive

thomasaarholt commented 3 years ago

@lerandc Have you had a chance to look at this?

prism-em / prismatic

Maximizing GPU utilization #84