wasiahmad / PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].
https://arxiv.org/abs/2103.06333
MIT License
186 stars 35 forks source link

Question about size of pretrain data #33

Closed shaoyangxu closed 2 years ago

shaoyangxu commented 2 years ago

Hi, I followed the tutorial to download the pretrain data step by step. After i downloaded all the .gz files and unzip them, i found there are just 123GB and 58.6GB data for java and python respectively(as to json file number,there are 1020 and 102 respectively), which mismatches the statistics in paper(352GB and 224GB respectively). Have i missed something?(I saw there is a de-duplicating step in that tutorial, which will reduce the data size, is it the reason?)

wasiahmad commented 2 years ago

Yes, your guess is right. The numbers mentioned in the paper are without deduplication.