github / CodeSearchNet

Datasets, tools, and benchmarks for representation learning of code.
https://arxiv.org/abs/1909.09436
MIT License
2.18k stars 385 forks source link

Less number of data found than stated in the paper #225

Closed sajeedmehrab closed 3 years ago

sajeedmehrab commented 3 years ago

The paper says that there are 503502 data available for python, but when I download the python data, I get 457461 data combining the 14 files of train data, the file for test data and the file for valid data. I used the whole corpus (of size 1.1M) to find the data with non-empty 'docstring' field and ended up with the reported number of 503502 though. I assume 46k data have been filtered but cannot seem to find why.

sajeedmehrab commented 3 years ago

In section 2, under "dataset statistics", readers are referred to "Table 1" when talking about the resulting dataset after filtering. This misled me into thinking that the filtered dataset is of size 503502. Under more careful observation, I have noticed that the table only lists the number of data with documentation, and further filtering had been done to get to the final dataset. This solves the issue!