salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.68k stars 392 forks source link

Pretraining Dataset #64

Closed ShushanArakelyan closed 1 year ago

ShushanArakelyan commented 1 year ago

Hi, could you please clarify the dataset you use for pretaining? In one of the earlier answers, I saw a mention of using "all non-valid/test examples from CodeSearchNet for pretaining". Does this mean examples from the training portion of the dataset?

For example, for Python, I counted only 412178 such examples in CodeSearchNet, but in the paper, you mention you used 453,772 Python examples for pretraining from CodeSearchNet. I am not sure where this inconsistency comes from.

Also, you refer to CodeBERT when talking about pretraining data, but your dataset statistics numbers are also slightly different from the numbers in the CodeBERT paper, were there any additional filtering steps that you did for CodeT5?

Thanks in advance!

yuewang-cuhk commented 1 year ago

Hi, I remember that the training portion is not the same as non-valid/test portion and it is actually smaller. You can try to verify this. For data filtering, we might filter some too short or too long codes.