intelligent-machine-learning / glake

GLake: optimizing GPU memory management and IO transmission.
Apache License 2.0
376 stars 33 forks source link

Is there any speed test against Pytorch Memory and GMLake? #10

Closed jimmieliu closed 10 months ago

jimmieliu commented 11 months ago

Great work guys! I wonder if there is any speed test between using P-M and G-M under use cases like LLM training with LR (LoRa + Recomputation), without heavy cpu-gpu memory transmission.

My concern is your proposed G-M technique might slow down the training process, as I cannot find any proof it might not since your work has not published yet.

ruizhang1230 commented 11 months ago

First of all, thank you for your interest in our work. In fact, we conducted speed tests and found that our performance is the same as P-M. In our tests, except for the memory-efficient strategy which includes zero-offload test cases, all other test cases use lora+recomputing without zero offload.

Our tests are designed to meet the requirements of large model training configurations. More detailed explanations will be provided in our paper. Of course, you can also use the examples from the docker image we released, which run the opt-1.3b model. You can find that our method without any performance degradation. The slight decrease in performance only occurs in the initial steps when performing stitching operations, but once all possible block combinations are stitched, the performance will be consistent with PyTorch.

For a more detailed explanation, please stay tuned for our upcoming paper.