How to use a prompt for text analysis?

rsoika commented 2 months ago

I am still confused how to use the prompt within Mistral-7b-instruct if I want to analyze the content of a text, such as summary or categorization of the context. So in my prompt I have a text and an instruction. I want to extract information form the given text. And it is not this often discussed chat-use-case.

I am currently using the following Prompt Template for my approach:

<s>[INST] My Instruction............
[/INST]
My Text.......
................

Is this a correct way to build a prompt for text analyses with Mistral-7b-instruct? Or should I separate the context and the instruction in some other way? Note, I also do not use the EOS here. Is this a problem?

I ask this because the results vary if I put the context before or after the instruction and I did not find a guide line for such a scenario.

I am also referring to this official documentation page which is hard to understand:

Note that <s> and </s> are special tokens for beginning of string (BOS) and end of string (EOS) while [INST] and [/INST] are regular strings.

What does this sentence mean? What is the difference between a string and a regular string?

Thanks for any tips.

pandora-s-git commented 2 months ago

Hi, okay so it seems there was a misunderstanding on how the prompt works. The Chat Template you are using is supposed to work like this:

<s>[INST] My Instruction.... My Text.... Or other details... [/INST]

To better understand it, see it like this: [INST] => User: and [/INST] => Assistant:. So its equivalent of this:

<s>User: My Instruction.... My Text.... Or other details... Assistant:

I hope this answers your doupts !

rsoika commented 2 months ago

@pandora-s-git Thanks for your answer!

But still I do not fully understand the usage of [INST] and <s>.

I want to explain what I do so far. My goal is to analyze and extract content from business documents. And Mistral-7B-Instruct did a really good job so far!

My First Prompt template looks like this:

<s>
[INST] You are a clerk in a logistics company and your job is to check incoming invoices. [/INST]       

<FILECONTEXT>

</s>
[INST] Briefly summarize the data in this invoice document:

- Company
- Payment data (with invoice amount, bank details, payment term)
- Invoice items (with name, quantity, price and tax)

Don't calculate any amounts yourself, but only accept amounts that actually appear in the invoice document!

[/INST]

And <FILECONTEXT> is the place where I insert the ocr-extracted text from an invoice. This text looks terrible contains new lines, tabs and spaces everywhere. But the result generated by Mistral-7B-Instruct is clear and well structured:

1. Company Data:
- Invoice is issued by CXA CGM, a French logistics company with VAT number FRXXXXX.
- The invoice is for AlexXXXX GmbH in Germany, with VAT number DExxxxx.
- The services were provided by Simone xxxxx in Germany.

2. Payment Data:
- Total amount due: 82.50 EUR
- Payment should be made to XX & YYY BV, with IBAN DExxxxxxx and SWIFT code DEUTxxx.
- Payments should be sent to the account of Deutsche Bank xxxx, Germany.

3. Service Items:
- There are two service items listed on the invoice, each with a container number, size/type, billable days, and tax information.
- Container TCxxxx (20ST) had 1 calendar day of billable time and a rate of 41.25 EUR per day, resulting in an invoiced amount of 41.25 EUR before tax.
- Container TRyyyyy (20ST) had 1 calendar day of billable time and a rate of 41.25 EUR per day, resulting in an invoiced amount of 41.25 EUR before tax.
- The total excluding tax for both containers is 82.50 EUR, and the total including tax is also 82.50 EUR.
- There is no VAT charge listed on the invoice.

Would you say that the placement of <s> and [INST] tags makes sense in my prompt? It was all guessing as I do not understand the difference between a string and a regular sting the documentation mentioned.

I then generate much more complex prompts from the first prompt result using much more additional information and I useing few-shot learning. But this is hard to construct when you do not understand how mistral-7b treats the different kind of prompt tokens. For example: is it allowed to use <s> multipel times in one prompt?

pandora-s-git commented 2 months ago

This is close, by string it's like a full completion.

Let me try to rexplain what everything means.

As you may know, LLMs are Text Completion Machines, so they only complete text.

We fine tunned them to complete dialogs so we can chat with them, these are the Instruct versions.

Basically we need to train them with something similar to a chat like:

User: User Question.
Assistant: Assistant Response.
...

But to make the key words more unique, we rennamed them with specific strings, here [INST] defines the users instruction, and [/INST] defines the end of the instruction, and so the assistants response.

However, the model needs a token that allows us to know when its finished, cause if not it will just continue trying to complete the text, thats the EOS token (End Of String), and we also have one for the beginning (BOS).

Here the BOS is <s> and EOS </s>, basically a string in this context is a full completion/exchange with the assistant until it considered the completion finished.

I think its better for ur use case to have something like this:

<s>[INST] Here is a text:
{Your document}

And then the explantion+Instruction
 [/INST]

The model will then complete and finish with a EOS (if it doesnt maybe by default ur code removes it).

Basically if you have a dialog with an LLM it should look like this:

<s>[INST] Hi [/INST] Hi, how can I help you?</s>[INST] Tell me a joke [/INST] (and the model continues here)

There are chat templates you can use directly, Its better to directly use the jinja template that its available in the tokenizer for example.

( Check the repo 'mistral-common', should help you )

rsoika commented 2 months ago

Ok, Thanks a lot! This helps me really to understand mistral-7b-instruct much better. And now I beginn to make progress :-) You explanation should become part of the official documentation of the chat-template

pandora-s-git commented 2 months ago

I'm glad you liked it, here is a chat template that may be of use for u:

"{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}"

If you dont know how to understand this, check how to use a jinja template and you should find some clues.

rsoika commented 1 month ago

I just want to share my final prompt templates that I use now to summarize the content of complex business documents.

Categorize Document

First I use a template to just categorize the content in a company name and a language

<s>[INST] You are a clerk in a ... company and you job is to check business documents. Extract the language the document is written in and the company name of the sender.

Output the information in a JSON object. Create only the json object. Do not provide explanations or notes.

Example JSON Object:[/INST]
{
  "invoice.language": "English",
  "cdtr.name": "Kraxi GmbH",
}
</s>
[INST]

{My document content}

[/INST]

Summarize Document

Next I use the following template to summarize the data of the document.

<s>[INST] You are a clerk in a ... company and your job is to check business documents.....[/INST]

{My document content}

</s>
[INST] Briefly summarize this  document ......
.......
[/INST]

The results are very good now. Of course, every word in the instruction counts!

mistralai / mistral-inference

How to use a prompt for text analysis? #145

Categorize Document

Summarize Document