loubnabnl / santacoder-finetuning

Fine-tune SantaCoder for Code/Text Generation.
Apache License 2.0
179 stars 22 forks source link

Has Santacoder done any pre-training in C or C++? #18

Closed lionday closed 11 months ago

lionday commented 1 year ago

Why can it continue writing C or C++ code? Was it pretrained on C or C++ code?

loubnabnl commented 1 year ago

Although we filtered for Python/Java/JS files there might be some C/C++ code in those files that the model learned. You can refer to this paper for some details about language spillover

Language models are able to generate code with correct syntax and pass unit tests in programming languages they are not intentionally trained on. We hypothesize that the data “spillover” effect, where code in one language is present in other languages through code comments or co-occurrences. Such amount of “spillover” data are enough for large language models to learn different languages that are embedded within the main language.

lionday commented 1 year ago

Thank you very much for your response. Additionally, I would like to ask if you have any other pre-trained models for C or C++ code besides StarCoder.

loubnabnl commented 1 year ago

We’ll release some smaller versions of StarCoder in a few weeks

lionday commented 1 year ago

okay,Thank you for your work!