jiaweizzhao / GaLore

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Apache License 2.0
1.24k stars 131 forks source link

How can i do continued pre-training using this? #21

Open Aloukik21 opened 3 months ago

jiaweizzhao commented 3 months ago

For now, you can specify your checkpoint path using args.continue_from in torchrun_main.py

jjhoow commented 3 months ago

Taking advantage of the question, I would like to know if I can use galore_adamw8bit_per_layer to train specific layers while freezing others? If so, could I use llama-pro (https://github.com/TencentARC/LLaMA-Pro) to increase the layers of a model like Mistral and then train only those layers using Galore?

I thought of something like this:

for layer_num, layer in enumerate(model.model.layers):
        for p in layer.parameters():
            if 32 <= layer_num <= 39 and p.requires_grad:
                p.requires_grad_(True)
            else:
                p.requires_grad_(False)
trainable_params = [p for p in model.parameters() if p.requires_grad]
jiaweizzhao commented 3 months ago

@jjhoow per-layer GaLore should achieve it out-of-box, as it will assign the optimizer only if param.requires_grad is True, see here: https://github.com/jiaweizzhao/GaLore?tab=readme-ov-file#save-weight-gradient-memory-using-per-layer-weight-updates

jjhoow commented 3 months ago

@jiaweizzhao I performed the test earlier with the code above and using galore_adamw8bit_per_layer, i need to get the parameters right because I noticed a roller coaster movement.