How to improve the performance of PAL on the open source LLM

liuhu commented 1 year ago

Background

We have a chatbot that uses the PAL method to program some custom functions to answer user questions in combination with user data in the system. User data is sleep and exercise data uploaded by users through smart wearable devices. The data types are very rich (including 50+ data fields), and the amount of data is large (every user will generate multiple pieces of data every day). The Python code generated by LLM, that determines the time range of the query data, the data fields that need to be queried, the function arrangement and other information

Prompt template：

As a sleep and sport AI, You focus on sleep, health, and exercise. You provide Python code to answer sleep-related or and sport-related questions with personal data. 
...

## At any point, you have access to the following functions:
- get_data_by_date_range(start_date: str, end_date: str, fields: list): Query the specified sleep and sport metrics data for the user within a specified time range.
- draw(data: list): Plot the graph based on the queried data and the required metric.
- summarize(data: list, question: str): Respond to non-graphical sleep-related questions from the user based on the queried sleep data.
- combination_response(summarize_response_list: list, chart_list: list): Used to aggregate user query results and return them uniformly.
...

## Here are all sleep data metrics (fields definitions) we have:
- `sleep_duration`: Sleep duration, in minutes.
- `rem_duration`: Duration of time spent in REM (rapid eye movement) sleep, in minutes.
- `resting_heart_rate`: Resting heart rate.
... (The other 50+ field descriptions are omitted here)

## Here some examples of how to use the functions:
Human: Show a chart displaying the duration of my sleep duration and heart rate this week, and give me some suggestions.
AI: 
```python
start_date = "2023-04-17"
end_date = "2023-04-23"
fields = ["sleep_duration","resting_heart_rate"]
sleep_data = get_data_by_date_range(start_date, end_date, fields)
summarize_resp = summarize(sleep_data, "Please analyze my sleep data give me some suggestions.")
draw_chart = draw(sleep_data)
response = combination_response([summarize_resp], [draw_chart])
``
... (The other 4 examples are omitted here)

Thank you very much for your patience in reading this far. I wrote a lot of background information in order to describe the problem, which resulted in a very long text.

Question

PAL is an amazing method. We have already used it in production. We want to replace openai LLM with the open source LLM and encounter some problems:

Recently, many popular open source LLM have been released. Have we conducted supplementary tests? Is there any recommended open source LLM?
Accuracy on our test cases:

LLM	Accuracy
gpt-3.5-turbo	96%
PaLM2(text-bison@001)	72.88%
WizardCoder-15B	45%
Vicuna-13B	29%

Found by analysis: a. Compared with gpt-3.5-turbo, The PaLM2, WizardCoder, and Vicuna all have a decline in date reasoning performance. Is there any way to improve date reasoning? b. The generalization ability of WizardCoder-15B and Vicuna-13B is insufficient. There are many output codes, basically copying few-shot, and not generating code according to the problem. Is it caused by insufficient model parameters?

Our prompt is very longm, There are many field descriptions, function descriptions, and few-shot examples in it. Can we fine-tuning to reduce the number of input tokens? The few-shot can be omitted, but the functions description and fields definitions can be omitted?
Any suggestions for fine-tuning the training and testing datasets?

Thanks again everyone, if you can pick some questions and help me answer them

urialon commented 1 year ago

Hi @liuhu , Thank you for your interest in our work and for your kind words!

We haven't conducted much experiments with other open source models. I agree that new open source models come out every day, claiming to surpass ChatGPT, but they are eventually found not to be as general and as adaptive as ChatGPT.

I don't have a clear solution to that other than trying a few others (maybe Falcon?).

Regarding fine-tuning: yes, I think that if you have the resources and the data, fine-tuning on your examples can help reducing the prompt size to as small as the example-specific inputs.

Maybe @luyug @madaan @shuyanzhou have some thoughts.

Best, Uri

HXCrazy commented 1 year ago

We are also using programming to solve user problems through PAL, and we are facing issues with long prompts and inference. We would like to know if fine-tuning is effective in addressing these issues or if there are other solutions that we can consider.

urialon commented 1 year ago

@HXCrazy thank you for reaching out!

Please describe the problems you are facing in a new issue so we could provide a better response.

Best, Uri

reasoning-machines / pal