Is it possible to support the new Bone method?

JL-er commented 1 month ago

Feature request

In the paper "https://arxiv.org/pdf/2409.15371", we introduce a brand new PEFT (Parameter-Efficient Fine-Tuning) method: BONE (BLOCK AFFINE). This is a completely new structure that differs from the LoRA series. In terms of performance, it has already surpassed PISSA.

Therefore, I hope that Bone can be added to the PEFT repository. Would it be possible for PEFT members to help with the adaptation, or provide a simple template for me to adapt it myself? （You can learn more about the structure of Bone in detail in the Bone repository.）

Motivation

We hope that the PEFT repository will add Bone.

Your contribution

I have written the paper and code for Bone, and will subsequently verify and test it on PEFT.

BenjaminBossan commented 1 month ago

Yes, it's possible to add new methods such as the one you propose.

From reading the paper, the Bone method would correspond to a completely new method and is not a modification of an existing method (e.g. PiSSA is a modification of LoRA). As a new method, a little bit of extra work is required. It's probably easiest to work with an example. In the recent past, we added HRA as a new method, you can check the pull request here: #1864. For Bone, something similar would be required.

To get you started, you could create a draft PR without any examples or documentation. Start with a simple test, e.g. by extending the test matrix here similar to what HRA did.

From this draft, we can iterate until the we finalize the contribution. Does that work for you?

Also, just some feedback on the paper (of course take it or leave it, I'm not a ML researcher):

I saw a couple of typos in the text, it would be good to get someone to proof-read.
The graphs with the losses are hard to read at times, as they are very small and the lines are very close.
For table 4, the LoRA score for Math is very low, is that right? Is it lower than from the base model?
For table 7, how exactly was memory measured, is it peak memory during training? Would it not make sense to also show LoRA + checkpointing?

JL-er commented 1 month ago

Thank you very much for your reply and feedback.

Regarding LoRA's performance in math in Table 4, it is indeed strange and very low. I conducted two training sessions, and both resulted in very low performance, with training settings identical to PISSA (I will retrain it again later).
The memory measurements in Table 7 represent the peak memory usage during training. Additionally, LoRA + checkpointing is the actual training setup used on RWKV6 (when the model is large or when users want to use a larger batch size, which generally leads to insufficient memory).

JL-er commented 1 month ago

I retrained LORA and tested it on math. Here are the results:

The test script can be found in the Bone repository, and Bone was tested under the same parameters (the Bone method was adapted to PEFT in the PEFT-Bone repository). Here are the results: e8faf87c-08ae-49ac-9fcb-db69ee3f3554 992e6a36-d96a-4356-85f0-814756880855

BenjaminBossan commented 1 month ago

I retrained LORA and tested it on math. Here are the results:

Thanks for re-testing. So the results are not much different from what you reported in the paper. I checked the training script, I assume it's this one and you used the default arguments except for the ones defined here, is that right?

In that case, the learning rate would be 5e-5, right? Maybe you could try a higher one for LoRA, like 5e-4 or maybe even higher. I also see lora_alpha=32 for a rank of 8, could you try 16 instead?

JL-er commented 1 month ago

Now you can check the settings in hf-ft/scripts within Bone. The parameters for bone, pissa, and lora are all the same except for their respective structural differences (consistent with the experimental settings in the paper). The settings for lora are based on those in the pissa paper and have not been deliberately optimized to their best configuration (I believe we don't need to worry about whether lora's settings are optimal because olora and pissa already significantly outperform lora, so ensuring that pissa and olora are correctly set up is sufficient). Lora's performance on math is indeed surprisingly poor, but it still performs normally on gsm8k (having been retrained three times and tested three times). The math tests were conducted using the opencompass repository. I am very confident about Bone's performance; however, during computation, intermediate values cause high memory usage, so checkpointing is currently needed to address this issue (slower than the lora series, but if checkpointing is also required for training with the lora series, then there isn't much difference in speed between them when training a 12B model).

huggingface / peft