swap model to llama3/gemma/mistral

edchengg commented 5 months ago

If i want to change the base model to something else for fine tuning, what should I be aware of and modify? i see the codebase has a flash attention modeling code for llama2. i’m just curious if it would be possible to just load models from huggingface or i still need to add the flash attention code manually?

Thanks a lot!

ikergarcia1996 commented 5 months ago

Hi @edchengg!

This codebase was developed when Flash Attention was not yet integrated into huggingface. That is why we used a custom implementation.

If you plan to fine-tune a model, we have an improved codebase. Please use the dev branch of this fork: https://github.com/ikergarcia1996/GoLLIE/tree/dev

This codebase uses the hugging face integration of flash attention and supports MOE models, along with other small improvements. The model configuration and the way the library works are the same. We just made some changes to the load_model function and the trainer. You can replace CODE-LLAMA with any other model, and it should work straightforwardly. In the repo, there are configuration examples that use other models.

We are planning to release GoLLIE2 in the future using the new codebase and more data. However, both Oscar and I are currently writing our PhD theses, so it will need to wait a few more months.

If you find any issue with the updated codebase, let me know and I will try to help you :D

git clone https://github.com/ikergarcia1996/GoLLIE.git
git checkout dev

edchengg commented 5 months ago

This is fantastic! Thanks a lot and I will try the branch today. Btw, are you aware of any easy inference speed-up tricks? It takes a lot of time to run evaluation on large dataset so I was wondering if there are any simple things I can add on to the huggingface to speed up it. Congrats and good luck with your PhD theses :D

ikergarcia1996 commented 5 months ago

If you are using LoRA, merge the adapters before the evaluation. It will be done automatically if you don't use 4-bit / 8-bit quantization. If you use quantization, you can set the parameter merge_lora_before_inference: true in your config. You can also do it manually. In our experiments, the merged model is up to 10 times faster during inference.

If you are already doing inference with a merged model, you can try to use an API optimized for inference such as vLLM (https://docs.vllm.ai/en/latest/) or TGI (https://huggingface.co/docs/text-generation-inference/index). You can store the results in a file and use our evaluate.py file afterwards.

edchengg commented 5 months ago

Thanks! The new branch works perfectly and the 'merge_lora_before_evaluation' works as well.

edchengg commented 5 months ago

Hi @ikergarcia1996 ,

When I try to run inference with mistral, I met this error which is very strange inside flash attention.

File "/anaconda3/envs/***/lib/python3.9/site-packages/flash_attn/flash_attn_interface.py", line 80, in _flash_attn_varlen_forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd( RuntimeError: query and key must have the same dtype I tried to print out the dtype of query and key: torch.bfloat16 torch.float32 and it seems the key is in float32... It seems the KV cache module was taking float32 data initially. So after the kv cache update, the key and value tensors changed from bf16 to float32...

I checked with other models and it seems all experience the same error. I wonder do you have any idea of this? Thanks a lot.

After long time debug, I finally narrowed down to the root. If I removed the bf16 argument in the config file, I am able to run the inference. But the speed is super slow now. Settingpad_token_idtoeos_token_id:2 for open-end generation. 0%| | 0/25 [00:00<?, ?it/s]Settingpad_token_idtoeos_token_id:2 for open-end generation. 8%|████▍ | 2/25 [01:27<16:43, 43.64s/it]Settingpad_token_idtoeos_token_id:2 for open-end generation. 12%|██████▌ | 3/25 [02:54<22:39, 61.81s/it]Settingpad_token_idtoeos_token_id:2 for open-end generation.

ikergarcia1996 commented 5 months ago

HI @edchengg !

Yes, that is an error I have encountered multiple times. It doesn't occur when using deepspeed or bitsandbytes quantization. I believe it's a problem with Hugging Face, unrelated to our codebase. There are numerous similar issues reported for different models. The easiest fixes are disabling flash attention or enabling deepspeed. Our code simply loads the model using dtype=torch.bfloat16 and utilizes the Hugging Face trainer for running the inference. We don't handle the inference; everything happens inside the trainer.

In the code within this repository, we resolved it by manually casting all the values to the correct data type. However, in the updated branch, as we no longer employ a custom flash attention implementation and instead rely on the Hugging Face code, we are unable to manually fix the issue.

edchengg commented 5 months ago

Hi @ikergarcia1996 , good to hear that! I think this is a problem with Hugging face.

I will revert back to codellama codebase then.

hitz-zentroa / GoLLIE

swap model to llama3/gemma/mistral #17