huggingface / peft

🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
https://huggingface.co/docs/peft
Apache License 2.0
16.6k stars 1.64k forks source link

Integrating X-LoRA #1472

Closed EricLBuehler closed 8 months ago

EricLBuehler commented 9 months ago

Hello all, and thank you for your great work!

Earlier this week, we announced X-LoRA, a flexible MoE approach for LoRA adapters. We implement deep layer- and token-wise scalings for multiple LoRA adapters and provide an implementation (https://github.com/EricLBuehler/xlora) that enables straightforward application to any model to which peft LoRA adapters may be applied to. This offers the possibility to orchestrate at a much finer level, that is, to achieve new combinations of adapter layers. This results in never-before-seen per-token deep layer-wise combinations of parameters to solve specific tasks. Sample weights are provided at: https://huggingface.co/lamm-mit/x-lora with examples in protein science in the paper.

Would you be interested to perhaps integrate X-LoRA into peft? I would be happy to work on this if there is interest from you and the community.

BenjaminBossan commented 9 months ago

Thanks for bringing this to our attention, this could be an interesting addition. I just had a quick glance at the repo but it looks quite solid to me.

I think something we should pay attention to if we want to integrate it to PEFT is the API. Ideally, we would like to be able to initiate an X-LoRA model (or load a trained one) just based on a config using the usual PEFT/transformers methods. That way, we could ensure that users can make use of X-LoRA without running custom code. I'm not sure how feasible that is, but it would be great to have.

EricLBuehler commented 9 months ago

I think this would be feasible. The primary entry point for converting to X-LoRA is a function add_xlora_to_model. It already takes a PreTrainedModel and we have a config class that could easily be ported. Specifically, would adding a new model type class where I would port the mechanics from add_xlora_to_model to the mapping make sense?

BenjaminBossan commented 9 months ago

would adding a new model type class where I would port the mechanics from add_xlora_to_model to the mapping make sense?

Do you mean a new model class on the level of PeftModel, or something on the level of LoraModel? If the latter, yes, I think that would be a good idea here. Maybe you could just start with a quick draft PR and we see from there? Often it's easier to discuss once we can see the code.

EricLBuehler commented 9 months ago

I have an implementation that is nearly ready for a draft PR on my fork. However, after looking at how PeftModel initializes its base model, I think a modification to the instantiation call for the PeftModel base model may be necessary for xLoRAModel to be on the same level as LoraModel.

This is because in the xlora library we create a PeftModel which needs to be modified based on the xLoRAConfig (loading adapters, injecting methods and hooks, etc.). For example, we override the save_pretrained method to save adapters and the classifier in specific locations.

Perhaps one solution is to pass a reference to the PeftModel when creating the xLoRAModel? This way, the necessary methods can be overridden, and the xLoRAModel would be on the same level as LoraModel. Do you think that this method of passing a reference would be a good idea, or is there another way?

BenjaminBossan commented 9 months ago

It would be much easier to discuss possible solutions once I see how the code looks like :)

EricLBuehler commented 9 months ago

Ok, sounds good! I will submit a draft PR after I complete some small testing.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

A11en0 commented 4 months ago

It appears that the inference process is very slow when I using four LoRA experts. In fact, it is significantly slower than four times the single LoRA inference time. Could you please explain why for me, thanks very much!

BenjaminBossan commented 4 months ago

@A11en0 Thanks for trying this out so quickly after it was merged :)

Do you have some code to replicate the issue? What times did you measure?

A11en0 commented 4 months ago

When utilizing Xlora with four experts, processing 2000 samples—each averaging 500 in length—requires approximately 10 hours.

A11en0 commented 4 months ago

Here are some key pieces of code:

load model:

        ...
        model = AutoModelForCausalLM.from_pretrained(model_weights).to("cuda")

        model.config.use_cache = False   # for xlora
        if lora_weights:
            model = PeftModel.from_pretrained(model, lora_weights)
        ....

inference:

            model.eval()
            inputs = tokenizer(prompt, return_tensors="pt")
            input_ids = inputs["input_ids"].to('cuda')
            generation_config = GenerationConfig(
                temperature=args['temperature'],
                top_p=args['top_p'],
                top_k=args['top_k'],
                num_beams=args['num_beams'],
                max_new_tokens=args['max_new_tokens'],
                do_sample=True
            )

            generation_output = model.generate(
                input_ids=input_ids,
                pad_token_id=tokenizer.eos_token_id,
                generation_config=generation_config,
                return_dict_in_generate=True,
                output_scores=True,
                do_sample=True,
            )

            s = generation_output.sequences[0]
            output = tokenizer.decode(s[input_ids.shape[1]:], skip_special_tokens=True).strip()
BenjaminBossan commented 4 months ago

Thanks for providing more details.

When utilizing Xlora with four experts, processing 2000 samples—each averaging 500 in length—requires approximately 10 hours.

This number alone does not really tell us much, how long does it take to process the individual LoRAs separately -- you mentioned it's "significantly slower"?

A certain overhead could be expected to run the classifier part, but normally that shouldn't be too much. Could you also share your X-LoRA config?

A11en0 commented 4 months ago

When I use a single lora, it needs about only 5 hours. Here is my Xlora config part for training:

saved_lora_adapters: lora_a,lora_b,lora_c,lora_d
use_trainable_adapters: true
softmax_temperature: 0.1

and others are default.

BenjaminBossan commented 4 months ago

So in your case, running X-LoRA with 4 adapters is roughly twice as slow as running LoRA with a single adapter, right? That looks indeed a bit excessive, as most of the time should be spent on the base model, which is the same in each case.

To test this further, I ran a small benchmark to check the overhead of X-LoRA. This benchmark uses a rather small model, bloomz-1b1, as the X-LoRA overhead should be expected to be larger the smaller the base model is. Cache was deactivated. For inference, I found this:

So there is indeed a quite significant overhead, at least for this "small" model. Not sure if @EricLBuehler has some insights into what the reasons could be.

I attached the notebook with the measurements. Download and change "txt" to "ipynb" to run it.

xlora-timing.txt

A11en0 commented 4 months ago

Thanks for your quick reply! I have executed the Jupyter notebook you provided and reviewed the source codes of xlora. I noticed that xlora performs two forward passes in the pre_forward() function, which should increase inference time by at least 2 times compared to a normal lora model. However, this does not clarify why your results show that the time taken by xlora is only slight more than lora. Could you assist me with this matter once more?

BenjaminBossan commented 4 months ago

I noticed that xlora performs two forward passes in the pre_forward() function, which should increase inference time by at least 2 times compared to a normal lora model.

True, this is most likely the reason for the slowdown. I don't think this can be avoided based on the workings of X-LoRA. As mentioned, there is also some overhead for the generation of the X-LoRA scalings, but that should be relatively small by comparison.

However, this does not clarify why your results show that the time taken by xlora is only slight more than lora.

In the results I reported above, X-LoRA is 6 times slower than just LoRA. Did you observe something else?