jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Apache License 2.0
1.24k stars 131 forks source link

A few questions regarding the results and methodology. #28

Open roymiles opened 3 months ago

roymiles commented 3 months ago

Hi, thanks for releasing this work! it has all been very interesting to read. However, I do have a few questions regarding your results and methodology.

  1. For table 4. it seems that you train with a batch size of 16 but report the memory of these runs divided by 16. There would be a memory overhead of the model weights which is greater than the memory reported in the table. Is this way of reporting memory commonly done since it does not capture the entire picture? The memory is also reported as the same across all sub-tasks, yet for some of the tasks you are using a different batch size (e.g. 32 for CoLA).

  2. When you report the memory, do you include the overhead of allocating memory for SVD? SVD can have a large memory overhead in practice and especially considering it is only implemented in 32-bit.

  3. Figure 1. shows the impressive results of reducing the memory cost for training LLaMA 7B to within the budget of an RTX 4090. I have noticed that you also use an adaptive low-memory optimisation method (AdaLOMO). I am curious to how much memory improvement is gained from the gradient low-rank projection and how much is coming just from AdaLOMO. https://github.com/OpenLMLab/LOMO

  4. What do you mean by "token batch size"? Is this just the number of tokens for a single iteration?

  5. The Roberta-base fine-tuning results also seem to be very different from the results reported in the original LoRa paper.

Thanks again!

jiaweizzhao commented 3 months ago

Thanks for your inertest. Here are the answers to your questions:

  1. For the memory reported in Table 4, it follows the same standard as the memory estimate in Table 2, which is the total of parameters and optimizer states based on BF16 format. We only report this total memory on purpose to make it independent of the choice of batch size.

  2. The memory in Table 2 and 4 do not inculde SVD overhead. Since SVD can be computed either on CPU or GPU, the memory overhead is not included in the memory estimate. In practice, we use torch SVD, which induces ~1GB memory overhead for 7B model, but the total training is still under the memory budget of RTX 4090 (24GB). We will add more actual memory usage in the future.

  3. AdaLOMO primarily reduces the memory consumption by using per-layer weight updates and Adafactor together. It will benefit more if the underlying tasks and models are Adafactor-friendly. However, Adafactor (no beta1) sometimes occur worse performance and instability during training.

  4. Token batch size is the number of tokens for a single iteration. The difference in Roberta-base results might be due to different configurations. We are conducting more fine-tuning experiments which will be released in our next version.