More clever batching of layers

priamai commented 11 months ago

Hello this is an awesome project, I replicated it on Modal Labs on a small T4 GPU.

The problem I see now is that by loading one layer per time, you are not maximizing the GPU VRAM usage, for instance in this case it used only 1.6 GB of VRAM, I guess it is the size of one layer.

Would it be possible instead to load N layers with a configuration parameter?

Code example here: https://gist.github.com/priamai/61aa332c42b89f518dcf134c38dd593d

lyogavin commented 11 months ago

I'll try, but my understanding is the bottleneck is not there.

Current bottleneck is the model loading from disk -> GPU mem part. Batching more layers most likely won't help.

githubpradeep commented 11 months ago

@lyogavin i tried this out today. I have a suggestion here. What i noticed is the GPU is not utilized fully in this case. For example Screenshot 2023-11-30 at 9 12 02 PM

can you load multiple layers in to cpu and gpu to maximize the uitlization. what i feel is loading from disk is kind of slower. so better to load multiple layers in one read call

aisu-programming commented 11 months ago

Interested too.

volkerjaenisch commented 11 months ago

@githubpradeep I experience the same effect. The GPU utilization is around 20%. The GPU memory is not used completely. I have a 4090 with 24GB and only about 2GB are used. So, the calculation of more layers in the GPU would be (when enough GPU RAM) possible. This would also reduce the number and amount of disc accesses.

returning kvcache size: torch.Size([1, 8, 25, 128])
total disk loading time: 5.5500
total gpu loading time: 25.3290
total compression overhead time: 3.1772

The GPU loading time of 25 sec is IMHO quite long compared to the disk loading time. I have really fast SSDs but the internal memory bandwidth/clockspeed should be even faster. I will dig into this - looks like an bug to me.

Cheers, Volker

priamai commented 11 months ago

Indeed and in fact you could load from the disk in batches of layers, I think both .bin and .guff should support this kind of IO. Also conceivably RAM is cheaper so I would rather use a larger RAM instance to load as many layers from Disk and then do the shuffle to GPU as fast as I can without using the disk too much.

volkerjaenisch commented 11 months ago

I fixed the disk-loading timing calculation to synchronize the time with cuda:

                t = time.process_time()
                self.move_layer_to_device(state_dict)
                torch.cuda.synchronize()
                elapsed_time = time.process_time() - t
                # profile
                if self.profiling_mode:
                    total_gpu_loading_time.append(elapsed_time)

This corrects the GPU-loading measurement from 25sec to 20sec.

But still the loading time to the GPU seems way to long. I checked the bandwidth of my GPU to make sure that there is no problem. And the Bandwidth is OK with

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: NVIDIA GeForce RTX 4090
 Shmoo Mode

.................................................................................
 Host to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)    Bandwidth(GB/s)
   1000             0.4
   2000             0.8
   3000             1.2
   4000             1.7
...
   15000            5.2
   16000            5.3
   17000            5.4
...
   40000000         13.3
...
   60000000         13.3
   64000000         13.3
   68000000         13.3

This seems to level out at 13.3GB/sec.

OK I patched the benchmark to allow for bigger chunks

832000000 13.2 1232000000 13.2 1632000000 13.2

So moving 80 times 1.6GB to the GPU at 13.3GB/sec should yield 80*1.6/13.2 = 9.7 sec which is half of the time currently spent. The problem is IMHO the Zoo of small tensors sometimes only 64 elements long.

Optimizing the loading a single layer into the GPU may spare 10 second per step. Each step is about 60 seconds on the 4090. So this would be a 18% gain. Looks not so promising.

So I think it a really good Idea to optimize the number of layers (lets denote it layer_batch_size) in the GPU. The loading time will not become larger since the same amount of data has to be transferred. In fact the loading time should become smaller since not so many intermediary results have to be transferred.

But at least in the case of the 4090 there is 80% unused GPU time and 20GB unused RAM. This would allow for a layer_batch_size of 10 and could bring a factor of 5 as performance gain.

Cheers, Volker

lyogavin commented 11 months ago

torch.cuda.synchronize()

Great job. Yes. I'll fix the profiling and look into a few possible improvements.

lyogavin commented 11 months ago

@lyogavin i tried this out today. I have a suggestion here. What i noticed is the GPU is not utilized fully in this case. For example

can you load multiple layers in to cpu and gpu to maximize the uitlization. what i feel is loading from disk is kind of slower. so better to load multiple layers in one read call

Thanks for the suggestions. Yes currently gpu utilization may not be high. But my current guess is the main reason is not we only load single layer, it's because 1, bottleneck at disk loading time, thus gpu mostly waiting for compute 2, we don't have enough compute to parallel. I'll have to do more profiling to confirm which one is the root cause to fix.

priamai commented 11 months ago

Also take a look at this recent blog from pytorch with optimization strategies.

Torch.compile allows us to capture a larger region into a single compiled region, and particularly when run with mode=”reduce-overhead”, is very effective at reducing CPU overhead. Here, we also specify fullgraph=True, which validates that there are no “graph breaks” in your model (i.e. portions that torch.compile cannot compile). In other words, it ensures that torch.compile is running to its fullest potential.

lyogavin commented 11 months ago

Also take a look at this recent blog from pytorch with optimization strategies.

Torch.compile allows us to capture a larger region into a single compiled region, and particularly when run with mode=”reduce-overhead”, is very effective at reducing CPU overhead. Here, we also specify fullgraph=True, which validates that there are no “graph breaks” in your model (i.e. portions that torch.compile cannot compile). In other words, it ensures that torch.compile is running to its fullest potential.

THanks, will take a look when I have time.

volkerjaenisch commented 11 months ago

Also take a look at this recent blog from pytorch with optimization strategies.

I read this really interesting article. But IMHO these strategies will not help much for the current project.

Some of them e.g. the floatingpoint-int quantization is already included. And I am quite shocked that the blog states "without loosing accuracy" - which is IMHO absolute bull*.

The new compiler/optimizer and the clever idea of speculative evaluation will be of no great help since the NN is in a ongoing state of change. Most higher optimization techniques only pay out if the thing that is to be optimized has a much longer time to live than the time for the optimization. And this is IMHO not given here.

The only thing I can think of is the discussion on KV caching. But with that I am not familiar and hope that someone else may change light on. @priamai : I do not get from your code if the kv_cache_list is in the GPU or in CPU space?

Please correct me if I talk nonsense since I am new to AI. But I have an experience of 40 years optimizing math and code. At the end of the day AI is nothing other than math and code.

Cheers, Volker

lyogavin / airllm

More clever batching of layers #44