microsoft / CodeBERT

CodeBERT
MIT License
2.15k stars 442 forks source link

pre-training data about UniXcoder #244

Open skye95git opened 1 year ago

skye95git commented 1 year ago
Snipaste_2023-04-07_16-14-26

In the paper, C4 dataset is used to pre-training UniXcoder. Which subset of C4 is used in paper? en?

Snipaste_2023-04-07_16-16-48

How many pieces of data do you have in your training set?

guoday commented 1 year ago

from datasets import load_dataset

dataset = load_dataset("c4","en")["train"]