AdaptLLM models with Llama Index

microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs

https://aka.ms/GeneralAI

MIT License

3.68k stars 280 forks source link

AdaptLLM models with Llama Index #148

Closed mirix closed 8 months ago

mirix commented 10 months ago

Hello,

I am trying to use AdaptLLM/finance-LLM along with Retrieval-Augmented Generation (RAG) through Llama Index.

However, I am not able to make the template work. I have tried many things, the closest to a working solution being:

system_prompt = 'Please, check if the anwser can be inferred from the pieces of context provided. If the answer cannot be inferred from the context, just state that the question is out of scope and do not provide any answer.'

query_wrapper_prompt = PromptTemplate(
    '<s>[INST] <<SYS>>\n' + system_prompt + '\n<</SYS>>\n{query_str} [/INST]\n'
)

Every space a new line seems to be crucial. Any small difference changes everything.

With the proposed template sometimes the only output is [/INST] repeated a number of times.

Sometimes I obtain a consistent answer preceded by [/INST] but with multiple repetitions of the same couple of sentences.

Sometimes the output seems to describe the internal workings of Llama Index.

Other templates and formatting produce even worse results. For instance, in some cases the system prompt is ignored and the models answers on the basis of previous knowledge.

Has anyone managed to make this work?

cdxeve commented 10 months ago

Hi, thanks for your feedback🤗! The prompt template that uses system prompt and "[/INST]" is specifically designed for the chat model.

We highly recommend switching from 'AdaptLLM/finance-LLM' to 'AdaptLLM/finance-chat' for improved response quality.

Regarding your use-case, here's an example using the recommended 'finance-chat' model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("AdaptLLM/finance-chat")
tokenizer = AutoTokenizer.from_pretrained("AdaptLLM/finance-chat", use_fast=False)

# Put your query here
query_str = 'xxx'

your_system_prompt = 'Please, check if the answer can be inferred from the pieces of context provided. If the answer cannot be inferred from the context, just state that the question is out of scope and do not provide any answer.'

# Please integrate 'your system prompt' into the input instruction part following 'our system prompt'.
query_prompt = f"<s>[INST] <<SYS>>\nYou are a helpful, respectful, and honest assistant. Always answer as helpfully as possible, while being safe. Your responses should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.\n<</SYS>>\n\n{your_system_prompt}\n{query_str} [/INST]"

# NOTE: another option might be: skipping our system prompt and directly starting from your system prompt like this:
# query_prompt = f"{your_system_prompt}\n{query_str} [/INST]"

inputs = tokenizer(query_prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_length=4096)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

print(f'### User Query:\n{query_str}\n\n### Assistant Output:\n{pred}')

Feel free to let us know if you have any more questions🤗.

mirix commented 10 months ago

Hi,

I am trying 'AdaptLLM/finance-chat' as suggested and it seems to work fine.

However, the generation configuration does not seem to be taken into account.

First, with transformers 4.36.2, I receive the following warning twice:

/home/emoman/.local/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:389: UserWarning:do_sampleis set toFalse. However,temperatureis set to0.9-- this flag is only used in sample-based generation modes. You should setdo_sample=Trueor unsettemperature. This was detected when initializing the generation config instance, which means the corresponding file may hold incorrect parameterization and should be fixed.

It seems that the generation kwargs from the script are completely ignored and that they are read directly from 'generation_config.json'.

So, if I alter that file to:

{
    "_from_model_config": true,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "pad_token_id": 32000,
    "do_sample": true,
    "temperature": 0.0000001,
    "top_p": 0.0000001,
    "top_p": 1,
    "repetition_penalty": 0.1,
    "transformers_version": "4.31.0.dev0"
}

The warnings disappear, but the model keeps repeating itself, which would seem to indicate that 'repetition_penalty' is being ignored.

Some people suggest setting '_from_model_config' to false, but it does not change anything.

cdxeve commented 10 months ago

Hi, thanks for the feedback. I think we can resolve this warning by unsetting temperature and top_p.

Remove temperature and top_p from the generation_config.json file, making it look like this:

{
    "_from_model_config": true,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "pad_token_id": 32000,
    "transformers_version": "4.31.0.dev0"
}

I've tested this with transformers version 4.36.2, and it works fine now.

mirix commented 10 months ago

Yes, thank you. It works.

But, generally speaking, I believe that the generation kwargs explicitly set on the script should override the default configuration file.

I would seem that, if they don't, it means that transformers has switched to some sort of legacy mode.

Finally, the model does respond to repetition_penalty and other generation parameters.

But it is extremely capricious and I haven't found a way to consistently avoid repetition other than post-processing.

It is a pity because the model seem very good for my purposes.

I believe that this volatility may be intrinsic to vanilla Llama-2 and it is not a consequence of the "reading comprehension" adaptation.

That being the case, perhaps the best solution would be to replace vanilla Llama with something better stabilised such as Mistral, for instance. Tulu also shows a very steady behaviour.

cdxeve commented 10 months ago

Hi,

Thanks for your recommendation to switch our base models to Mistral and Tule. Mistral is indeed in our future plans.

Regarding this issue:

But, generally speaking, I believe that the generation kwargs explicitly set on the script should override the default configuration file

I completely agree that "generation kwargs explicitly set on the script should override the default configuration file".

But there might exist some conflicts in your config settings.

{
    "_from_model_config": true,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "pad_token_id": 32000,
    "do_sample": true,
    "temperature": 0.0000001,
    "top_p": 0.0000001,
    "top_p": 1,
    "repetition_penalty": 0.1,
    "transformers_version": "4.31.0.dev0"
}

According to the official documentation: https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/text_generation#generation

Firstly, setting the temperature to an extremely small value near 0 (0.0000001) creates a highly concentrated token distribution, behaving similarly to "do_sample"=false. This contradicts your setting of "do_sample": true.

Secondly, there are conflicting values for top_p in your configuration.

Then, the repetition_penalty value of 0.1 would make the problem even worse, and a value higher than 1 such as 1.2 is recommended to solve repetition.

The simplest setting for your config is like this, and you may refer to the official documentation for your specific usecase

{
    "_from_model_config": true,
    "bos_token_id": 1,
    "eos_token_id": 2,
    "pad_token_id": 32000,
    "repetition_penalty": 1.2,
    "transformers_version": "4.31.0.dev0"
}

mirix commented 10 months ago

Thanks for the advice, I will try that immediately.

But that configuration has been tested with many models. I have also tried going with the defaults and many other combinations.