meltyyyyy / LLM-Science-Exam

1 stars 0 forks source link

Discussionを読み込む #2

Open meltyyyyy opened 10 months ago

meltyyyyy commented 10 months ago

https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/433879

I've been monitoring this competition for the past few weeks or so, here are my recommendations

Here is your description- https://www.kaggle.com/competitions/kaggle-llm-science-exam/data
long story short, they used an LLM to write a set of questions with multiple possible answers, having only one the correct or BEST answer. The training dataset contains only 200 questions, this may definitely be too small so that's why there were multiple members who created additional datasets and shared it to the community, the most famous of these datasets belonging to #2 below.

Kaggle member [Radel Osmulski](https://www.kaggle.com/radek1)- created supplemental datasets and models for this competition, here's a list of some of his shared content

https://www.kaggle.com/datasets/radek1/best-llm-starter-pack
https://www.kaggle.com/datasets/radek1/additional-train-data-for-llm-science-exam
https://www.kaggle.com/datasets/radek1/15k-high-quality-examples
https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/432607
Here are some other Kaggle members who created LLM fine-tuning tutorials

https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/424519
Here's a good post about the "ART OF PROMPT ENGINEERING" , there may be other posts about prompt engineering, but this was written specific to the competition

https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/433311
Here's a Wikipedia dataset that I created in case you'd like to create your own training data, it also has discussion on how to create the training dataset which [@radek1](https://www.kaggle.com/radek1) also gave some insights

https://www.kaggle.com/datasets/bwandowando/wikipedia-index-and-plaintext-20230801
https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/432123
In case you generate additional training data, here's how to use ChatGPT / LLM to validate if the synthetically generated data is correct
https://www.kaggle.com/code/jhoward/getting-started-with-llms

I know there is a ton of content and know-how for this competition, I know there is a lot more outside the items I've enumerated above, but with the bullets above, I hope that you'd be able to get up to speed in no time.
meltyyyyy commented 10 months ago

gpt3.5で生成された15000の追加学習データ

https://www.kaggle.com/datasets/radek1/15k-high-quality-examples

TanimotoRui commented 10 months ago

規模の暴力だ... https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/436383

Hi everyone. I'm sharing my 60k sample dataset. Using this dataset, we can easily achieve LB 0.830+ with single model.

TanimotoRui commented 10 months ago

公開されている全データセットをfine-tuneできれば上位に残れるか?? それも複数モデルをアンサンブルする形で