johnsmith0031 / alpaca_lora_4bit

MIT License
533 stars 84 forks source link

Multiadapter PEFT? #60

Open gururise opened 1 year ago

gururise commented 1 year ago

Any chance of getting multi-adapter branch of PEFT integrated? https://github.com/huggingface/peft/tree/smangrul/multi-lora-support

Ph0rk0z commented 1 year ago

Also.. will there be any effect from the transformers llama tokenizer changes?

turboderp commented 1 year ago

I haven't gotten around to looking into it too much, but as far as I can tell all the LoraModel class is just a wrapper that installs some hooks into the LlamaModel. If you call disable_adapter_layers() you can use the original model as before, and then enable_adapter_layers() switch the adapter back on.

So if I'm not mistaken, you should be able to switch back and forth like this:

model1 = PeftModel.from_pretrained(base_model, lora1_path, device_map={'': 0}, torch_dtype=torch.float32)
model1.eval()
model1.disable_adapter_layers()

model2 = PeftModel.from_pretrained(base_model, lora2_path, device_map={'': 0}, torch_dtype=torch.float32)
model2.eval()
model2.disable_adapter_layers()

# Use model 1:

model1.enable_adapter_layers()
model1.generate(....)
model1.disable_adapter_layers()

# Use model 2:

model2.enable_adapter_layers()
model2.generate(....)
model2.disable_adapter_layers()

# Use original model

base_model.generate(...)

I haven't had time to train more adapters to test it with, though, so there might be some catch. But in theory it should work.

tensiondriven commented 1 year ago

Wow, this is great. From the code you provided, it looks like this allows for loading/unloading loras against a model that remains in memory. What I'm looking for is the ability to apply multiple loras concurrently, and then write the resulting model out to a checkpoint or safetensors or something I can put into production without having to incur the VRAM overhead caused by loading lora(s) at runtime.

I recently learned that adding a lora to llama-30b will blow the 24GB memory limit with full context. I am hoping :crossed_fingers: that loading a "flat" model with the loras baked in will allow for full-context inference while remaining inside the 24G limit.

If anyone knows better than me if this is fesible, either on the inference side or on the training side, please let me know. Any code samples would be great, too; python is not my first language.

I imagine this taking the form of a separate python file, which would take a base model and either a directory containing loras or (probably better, since i think order matters) a list of paths to loras to merge.

As of right now, I don't know of any projects/repos that support merging multiple loras into a composite/flat model. Text-generation-webui supports loading multiple loras for inference, but as far as I know there's no facility for "rendering" them down, and certainly not in a programmatic way. I may open a separate ticket to open up this request, as I think a lot of others will benefit from it.