Closed lionday closed 1 year ago
Although we filtered for Python/Java/JS files there might be some C/C++ code in those files that the model learned. You can refer to this paper for some details about language spillover
Language models are able to generate code with correct syntax and pass unit tests in programming languages they are not intentionally trained on. We hypothesize that the data “spillover” effect, where code in one language is present in other languages through code comments or co-occurrences. Such amount of “spillover” data are enough for large language models to learn different languages that are embedded within the main language.
Thank you very much for your response. Additionally, I would like to ask if you have any other pre-trained models for C or C++ code besides StarCoder.
We’ll release some smaller versions of StarCoder in a few weeks
okay,Thank you for your work!
Why can it continue writing C or C++ code? Was it pretrained on C or C++ code?