[Question]: Difficulty Reproducing Results in CoT.ipynb

microsoft / LLMLingua

To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

https://llmlingua.com/

MIT License

4.18k stars 222 forks source link

[Question]: Difficulty Reproducing Results in CoT.ipynb #123

Open ushakov opened 3 months ago

ushakov commented 3 months ago

Describe the issue

I attempted to reproduce the results of the LLMLingua paper using the CoT.ipynb notebook from the examples folder. However, I encountered a discrepancy in the accuracy achieved. The result in CoT.ipynb reports 78% accuracy, but I only achieved 68% accuracy in my reproduction attempt.

Changes made to CoT.ipynb:

Updated the openai library to 1.14.2
Changed the "completions" API endpoint to "chat.completions".

The rest of the file was kept unchanged as per the GitHub version.

Expected Behavior:

The reproduction should yield results consistent with the reported 78% accuracy, as in the output of the last cell in the notebook:

num_q 1319 correct 1032 ratio 0.7824

Actual Behavior:

I obtained only 68% accuracy:

num_q 1319 correct 900 ratio 0.6823

Question

Is this expected? Any ideas what could be the problem here? If the culprit is the openai model used, any ideas how to fix this -- gpt-3.5 model family no longer allow non-chat inference...

Thanks in advance!

iofu728 commented 3 months ago

Hi @ushakov, thank you for your support with LLMLingua.

The gap is primarily due to the different modes of the OpenAI model. Currently, there are two ways to replicate the respective results:

Use Azure OpenAI, which still supports the gpt-3.5-turbo-0301 completion mode.
Use "gpt-3.5-turbo-instruction", which supports the completion mode and can be compared with the results of the original prompt.

Regarding the performance loss issue in chat mode, we are currently designing methods to make improvements.

ushakov commented 3 months ago

Thanks for the pointers! While I'm waiting for MS to approve my access to gpt in Azure API, I've run a test with gpt-3.5-turbo-instruct, and it unfortunately works even worse than gpt-3.5-turbo-0301 in chat mode that I used previously: full prompt gives 76.9%, compressed prompt gives 59.6%.

Yeqishen commented 4 weeks ago

Hi @ushakov, thank you for your support with LLMLingua.您好，感谢您对 LLMLingua 的支持。

The gap is primarily due to the different modes of the OpenAI model. Currently, there are two ways to replicate the respective results:差距主要是由于OpenAI模型的模式不同造成的。目前，有两种方法可以复制各自的结果：

Use Azure OpenAI, which still supports the gpt-3.5-turbo-0301 completion mode.使用 Azure OpenAI，它仍然支持 gpt-3.5-turbo-0301 完成模式。

Use "gpt-3.5-turbo-instruction", which supports the completion mode and can be compared with the results of the original prompt.使用“gpt-3.5-turbo-instruction”，支持补全模式，可以与原来提示的结果进行比较。

Regarding the performance loss issue in chat mode, we are currently designing methods to make improvements.针对聊天模式下的性能损失问题，我们目前正在设计方法进行改进。

Thank you for your answer, it is very helpful to me. When I tried to use Azure's gpt-3.5-turbo-0301, an error message appeared, which seemed to mean that the model no longer existed.

If possible, could you please tell me about the latest model that can be used in Azure to reproduce this experiment? I would be very grateful for this.

iofu728 commented 2 weeks ago

Hi @ushakov, thank you for your support with LLMLingua.您好，感谢您对 LLMLingua 的支持。 The gap is primarily due to the different modes of the OpenAI model. Currently, there are two ways to replicate the respective results:差距主要是由于OpenAI模型的模式不同造成的。目前，有两种方法可以复制各自的结果：

Use Azure OpenAI, which still supports the gpt-3.5-turbo-0301 completion mode.使用 Azure OpenAI，它仍然支持 gpt-3.5-turbo-0301 完成模式。

Use "gpt-3.5-turbo-instruction", which supports the completion mode and can be compared with the results of the original prompt.使用“gpt-3.5-turbo-instruction”，支持补全模式，可以与原来提示的结果进行比较。

Regarding the performance loss issue in chat mode, we are currently designing methods to make improvements.针对聊天模式下的性能损失问题，我们目前正在设计方法进行改进。

Thank you for your answer, it is very helpful to me. When I tried to use Azure's gpt-3.5-turbo-0301, an error message appeared, which seemed to mean that the model no longer existed.

If possible, could you please tell me about the latest model that can be used in Azure to reproduce this experiment? I would be very grateful for this.

Hi @Yeqishen, you can try to use "gpt-3.5-turbo-instruction".