Load qlora finetuned model using TGI optimized architecure on Sagemaker

zkdtc commented 1 year ago

Feature request

Enable TGI load qlora finetuned model with optimized architecture on Sagemaker. Right now the optimized architecture is active only for certain models on the list. If the features is enabled for all models, then it is great.

Motivation

Enable TGI load qlora finetuned model with optimized architecture on Sagemaker. Right now the optimized architecture is active only for certain models on the list. If the features is enabled for all models, then it is great.

I have tried several ways to do this but no success 1) use model_data=s3 path to load the merged model 2) Upload the merged model to HF and asks TGI to load specific MODEL_ID. TGI can load the model, but does no optimization like quantization and tensor parallelism. We need the optimization for fast inference.

Your contribution

Making proposal

Narsil commented 1 year ago

Hi

We cannot possible enable all features for all models (it's just too much work, models pop every day). Which model do you have in mind ?

Models using alibi are also a no-go because flash attention doesn't implement custom masking (which is needed for alibi). Unless we implement such a flash version.

lora/peft are now loadable in latest release (not yet in sagemaker).

zkdtc commented 1 year ago

Thanks for the reply @Narsil I am finetuning Falcon40. My hunch is: since the merged finetuned version has the same model structure as the original version, all the optimization should be naturally supported. It is a pity that it is not supported just because the model name is not on the accepted list.

Do you have a timeline for supporting lora/peft in sagemaker endpoint? In my naive mental model sagemaker just put your docker image to container. Do you mean Sagemaker people have not included your new version yet? If this is the case, then this is something I could try to push :-P

Narsil commented 1 year ago

t is not supported just because the model name is not on the accepted list.

This is not true, if the model_type in the config is supported, all finetunes work.

Do you mean Sagemaker people have not included your new version yet? If this is the case, then this is something I could try to push :-P

Sure, it takes a few weeks everytime we push new versions in sagemakers. They are coming, just lagging behind.

zkdtc commented 1 year ago

Good to know that model_type can do the job! Thanks you so much~ Look forward to the new image!

zkdtc commented 1 year ago

@Narsil How can I setup the config file so that the model can be correctly loaded in 8 bits/4bits with optimization enabled? Now I encounter the error of not enough memory space after loading the finetuned model on a ml.g5.12xlarge instance, which I never encountered with the original model. It seems to me the quantization is not correctly performed in my case. I use the same Falcon-40-instruct config.json file for my finetuned model.

zkdtc commented 1 year ago

Narsil commented 1 year ago

What version were you using before ?

Since 0.9.3 (or 0.9.4) we're trying to load the MAXIMUM possible throughput you have setup (or is present by default).

Therefore we crash early with Warmup error like it seems to be the case. You should modify --max-batch-total-tokens --max-input-length and further to fix it.

This is a bit more painful when loading the model, but it should prevent any crash during runtime.

1.0.1 should ease a bit the parameter fiddling as we take care of some of them semi-automatically (we look only at --max-input-length and --max-total-tokens and figure out the maximum batch on our own).

zkdtc commented 1 year ago

@Narsil The memory problem is solved after I set quantization to 4bits using 'HF_MODEL_QUANTIZE': 'bitsandbytes'. Seems the model was loaded in 8 bits by default instead of 4bits. A new problem arises, the inference speed is much slower than the original model, see the plot below. Seems there are optimization I still miss? Maybe some magic keywords in setting env?

zkdtc commented 1 year ago

What version were you using before ?

Since 0.9.3 (or 0.9.4) we're trying to load the MAXIMUM possible throughput you have setup (or is present by default).

Therefore we crash early with Warmup error like it seems to be the case. You should modify --max-batch-total-tokens --max-input-length and further to fix it.

This is a bit more painful when loading the model, but it should prevent any crash during runtime.

1.0.1 should ease a bit the parameter fiddling as we take care of some of them semi-automatically (we look only at --max-input-length and --max-total-tokens and figure out the maximum batch on our own).

Was using 0.9.3

sandys commented 9 months ago

@Narsil How can I setup the config file so that the model can be correctly loaded in 8 bits/4bits with optimization enabled? Now I encounter the error of not enough memory space after loading the finetuned model on a ml.g5.12xlarge instance, which I never encountered with the original model. It seems to me the quantization is not correctly performed in my case. I use the same Falcon-40-instruct config.json file for my finetuned model.

hi @zkdtc - did u figure this out ? we are hitting the exact same problem. did u have any idea on how to fix it ? we are finetuning llama2 though

Narsil commented 9 months ago

Bitsandbytes is just slow. Use AWQ for best latency (4bit) and EETQ for 8bits.

GTPQ has received exllamav2 recently (available in upcoming 1.2) which should speed it up too.

Narsil commented 9 months ago

Closing as the original qlora should be solved.

pradeepdev-1995 commented 7 months ago

@Narsil is it similar with https://github.com/huggingface/text-generation-inference/issues/1457?

Narsil commented 7 months ago

I don' t think so.

huggingface / text-generation-inference