Open knightmarehs opened 1 month ago
This is an interesting topic. This prompt is directly copied from the Microsoft/graphrag repository(https://github.com/microsoft/graphrag/blob/16b4ea5dc9c3c74ec6d97b1551db4631b40949ff/graphrag/query/structured_search/local_search/system_prompt.py#L6-L69). I don't think this is a mistake, as repeating the instruction twice may indeed lead to performance improvements (https://arxiv.org/abs/2402.15449). I think this might have introduced a certain degree of 'bidirectionality' in the expression. Although the paper I provided offers some guidance on embeddings, the concept of repeating twice has already been discussed somewhat in general QA. (I forget the paper lol)
@rangehow thanks for your sharing, If repeating requests in a prompt can indeed enhance the model's understanding and generation capabilities, it is indeed quite remarkable. However in the long run, if the model's capabilities are sufficiently advanced, we may not need such techniques (after all, it seems a bit odd and also consumes extra tokens). I think I will conduct a comparison of the effects when I have the time.
If you could try this comparison with a small model on some QA benchmarks and obtain quantitative results, that would be very helpful. Otherwise, we might temporarily continue to behave the same way as Microsoft. Once it is confirmed that there are no indeed benefits, you are more than welcome to submit a PR to the repository to modify it. 🤗
For example the Goal and Target response length and format repeated. Is this intentional repetition or some kind of error?
PROMPTS[ "local_rag_response" ] = """---Role---
You are a helpful assistant responding to questions about data in the tables provided.
---Goal---
Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge. If you don't know the answer, just say so. Do not make anything up. Do not include information where the supporting evidence for it is not provided.
---Target response length and format---
{response_type}
---Data tables---
{context_data}
---Goal---
Generate a response of the target length and format that responds to the user's question, summarizing all information in the input data tables appropriate for the response length and format, and incorporating any relevant general knowledge.
If you don't know the answer, just say so. Do not make anything up.
Do not include information where the supporting evidence for it is not provided.
---Target response length and format---
{response_type}
Add sections and commentary to the response as appropriate for the length and format. Style the response in markdown. """