paperswithcode / galai

Model API for GALACTICA
Apache License 2.0
2.68k stars 276 forks source link

Fine-tuning specific areas #65

Closed peng06051126 closed 1 year ago

peng06051126 commented 1 year ago

First of all, thank you for your great contribution. I would like to fine-tune galactica in the direction of generating articles from topics. Can you provide training data samples, or do you have any suggestions?

mkardas commented 1 year ago

The Galactica models were pretrained on large amount of papers (see our paper for more details):

image

so you should be able to generate articles out-of-the-box, but it depends on your use case.

peng06051126 commented 1 year ago

Thank you for your reply. May I ask how the model performs on non-English data? Has there been any relevant test? And what proportion does non-English data take in the pre-training data set, such as Chinese data, etc.

mkardas commented 1 year ago

By design the models are not multi-lingual and most of the natural language documents in the pretraining corpus are written in English. See more in Introduction to GALACTICA Models notebook (look for "multi-lingual").