tloen / alpaca-lora

Instruct-tune LLaMA on consumer hardware
Apache License 2.0
18.58k stars 2.21k forks source link

Spanish performance and comparison of performance between models. #88

Closed josemlopez closed 1 year ago

josemlopez commented 1 year ago

Hi there,

I recently noticed that the performance of the Spanish language model is subpar. To improve it, I want to add more Spanish language examples to the model. I was wondering if anyone else has a similar idea and what tools they are using to accomplish this.

Currently, I have only trained the model with some basic cleaning techniques. However, I want to incorporate an "automatic" cleaning method using this PR: https://github.com/tloen/alpaca-lora/pull/62 and compare the performance. It would be interesting to see how the quality of the data can impact the model's improvement.

I am also wondering if there are any benchmarks that should be run to measure the performance of the model. Any suggestions or insights would be greatly appreciated!

DanielWe2 commented 1 year ago

add more Spanish language examples to the model. I was wondering if anyone else has a similar idea and what tools they are using to accomplish this.

Someone else here in the comments did that with Korean and used OpenAI GPT APIs to translate some of the dataset.

I am also wondering if there are any benchmarks that should be run to measure the performance of the model. Any suggestions or insights would be greatly appreciated!

Take a look at https://github.com/EleutherAI/lm-evaluation-harness

Through that I learned how test like WinoGrando are actualy provide to the model.

What I tried was to build a prompt with instructions for the test data. As a human who uses at chatbot would do it. I think that would be more relevant but highly depend on fine tuning and the best prompt.

What is normal done is: For an multiple choice A/B test (like WinoGrande) provide both as full sentences to the model and let it calculate the probability for both variantes. And the variante with the higher probility is the one chosen by the model. That should show the theoretic performance of the actual model. This is more objective and totaly indpentend of the prompt. But also not really relevant in terms what a normal chatbot user would see.

Anyway I think it would be a good idea to test the lora models. The base version compared to the base+lora. We would see if the fine tuning somehow degrades the general model performance.

I am not aware if there are language specific tests. OpenAI release a chart for GPT-4 with model performance for each langauge. Not sure how they meassured that.

josemlopez commented 1 year ago

Thanks Daniel! This is very interesting. I'll follow your leads and share here some of my insights.

josemlopez commented 1 year ago

Just closing this issue, because I just realised that the best place for this is "Discussions". Here is the thread I just opened there: https://github.com/tloen/alpaca-lora/discussions/108 .

Thanks!