Pretraining Dataset - Githubissues

Hi, could you please clarify the dataset you use for pretaining? In one of the earlier answers, I saw a mention of using "all non-valid/test examples from CodeSearchNet for pretaining". Does this mean examples from the training portion of the dataset?

For example, for Python, I counted only 412178 such examples in CodeSearchNet, but in the paper, you mention you used 453,772 Python examples for pretraining from CodeSearchNet. I am not sure where this inconsistency comes from.

Also, you refer to CodeBERT when talking about pretraining data, but your dataset statistics numbers are also slightly different from the numbers in the CodeBERT paper, were there any additional filtering steps that you did for CodeT5?

Thanks in advance!

salesforce / CodeT5

Pretraining Dataset #64