Closed EricLBuehler closed 8 months ago
Thanks for bringing this to our attention, this could be an interesting addition. I just had a quick glance at the repo but it looks quite solid to me.
I think something we should pay attention to if we want to integrate it to PEFT is the API. Ideally, we would like to be able to initiate an X-LoRA model (or load a trained one) just based on a config using the usual PEFT/transformers methods. That way, we could ensure that users can make use of X-LoRA without running custom code. I'm not sure how feasible that is, but it would be great to have.
I think this would be feasible. The primary entry point for converting to X-LoRA is a function add_xlora_to_model
. It already takes a PreTrainedModel
and we have a config class that could easily be ported. Specifically, would adding a new model type class where I would port the mechanics from add_xlora_to_model
to the mapping make sense?
would adding a new model type class where I would port the mechanics from
add_xlora_to_model
to the mapping make sense?
Do you mean a new model class on the level of PeftModel
, or something on the level of LoraModel
? If the latter, yes, I think that would be a good idea here. Maybe you could just start with a quick draft PR and we see from there? Often it's easier to discuss once we can see the code.
I have an implementation that is nearly ready for a draft PR on my fork. However, after looking at how PeftModel
initializes its base model, I think a modification to the instantiation call for the PeftModel
base model may be necessary for xLoRAModel
to be on the same level as LoraModel
.
This is because in the xlora
library we create a PeftModel
which needs to be modified based on the xLoRAConfig
(loading adapters, injecting methods and hooks, etc.). For example, we override the save_pretrained
method to save adapters and the classifier in specific locations.
Perhaps one solution is to pass a reference to the PeftModel
when creating the xLoRAModel
? This way, the necessary methods can be overridden, and the xLoRAModel
would be on the same level as LoraModel
. Do you think that this method of passing a reference would be a good idea, or is there another way?
It would be much easier to discuss possible solutions once I see how the code looks like :)
Ok, sounds good! I will submit a draft PR after I complete some small testing.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
It appears that the inference process is very slow when I using four LoRA experts. In fact, it is significantly slower than four times the single LoRA inference time. Could you please explain why for me, thanks very much!
@A11en0 Thanks for trying this out so quickly after it was merged :)
Do you have some code to replicate the issue? What times did you measure?
When utilizing Xlora with four experts, processing 2000 samples—each averaging 500 in length—requires approximately 10 hours.
Here are some key pieces of code:
load model:
...
model = AutoModelForCausalLM.from_pretrained(model_weights).to("cuda")
model.config.use_cache = False # for xlora
if lora_weights:
model = PeftModel.from_pretrained(model, lora_weights)
....
inference:
model.eval()
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to('cuda')
generation_config = GenerationConfig(
temperature=args['temperature'],
top_p=args['top_p'],
top_k=args['top_k'],
num_beams=args['num_beams'],
max_new_tokens=args['max_new_tokens'],
do_sample=True
)
generation_output = model.generate(
input_ids=input_ids,
pad_token_id=tokenizer.eos_token_id,
generation_config=generation_config,
return_dict_in_generate=True,
output_scores=True,
do_sample=True,
)
s = generation_output.sequences[0]
output = tokenizer.decode(s[input_ids.shape[1]:], skip_special_tokens=True).strip()
Thanks for providing more details.
When utilizing Xlora with four experts, processing 2000 samples—each averaging 500 in length—requires approximately 10 hours.
This number alone does not really tell us much, how long does it take to process the individual LoRAs separately -- you mentioned it's "significantly slower"?
A certain overhead could be expected to run the classifier part, but normally that shouldn't be too much. Could you also share your X-LoRA config?
When I use a single lora, it needs about only 5 hours. Here is my Xlora config part for training:
saved_lora_adapters: lora_a,lora_b,lora_c,lora_d
use_trainable_adapters: true
softmax_temperature: 0.1
and others are default.
So in your case, running X-LoRA with 4 adapters is roughly twice as slow as running LoRA with a single adapter, right? That looks indeed a bit excessive, as most of the time should be spent on the base model, which is the same in each case.
To test this further, I ran a small benchmark to check the overhead of X-LoRA. This benchmark uses a rather small model, bloomz-1b1
, as the X-LoRA overhead should be expected to be larger the smaller the base model is. Cache was deactivated. For inference, I found this:
So there is indeed a quite significant overhead, at least for this "small" model. Not sure if @EricLBuehler has some insights into what the reasons could be.
I attached the notebook with the measurements. Download and change "txt" to "ipynb" to run it.
Thanks for your quick reply! I have executed the Jupyter notebook you provided and reviewed the source codes of xlora. I noticed that xlora performs two forward passes in the pre_forward()
function, which should increase inference time by at least 2 times compared to a normal lora model. However, this does not clarify why your results show that the time taken by xlora is only slight more than lora. Could you assist me with this matter once more?
I noticed that xlora performs two forward passes in the
pre_forward()
function, which should increase inference time by at least 2 times compared to a normal lora model.
True, this is most likely the reason for the slowdown. I don't think this can be avoided based on the workings of X-LoRA. As mentioned, there is also some overhead for the generation of the X-LoRA scalings, but that should be relatively small by comparison.
However, this does not clarify why your results show that the time taken by xlora is only slight more than lora.
In the results I reported above, X-LoRA is 6 times slower than just LoRA. Did you observe something else?
Hello all, and thank you for your great work!
Earlier this week, we announced X-LoRA, a flexible MoE approach for LoRA adapters. We implement deep layer- and token-wise scalings for multiple LoRA adapters and provide an implementation (https://github.com/EricLBuehler/xlora) that enables straightforward application to any model to which
peft
LoRA adapters may be applied to. This offers the possibility to orchestrate at a much finer level, that is, to achieve new combinations of adapter layers. This results in never-before-seen per-token deep layer-wise combinations of parameters to solve specific tasks. Sample weights are provided at: https://huggingface.co/lamm-mit/x-lora with examples in protein science in the paper.Would you be interested to perhaps integrate X-LoRA into
peft
? I would be happy to work on this if there is interest from you and the community.