microsoft / DeBERTa

The implementation of DeBERTa
MIT License
1.91k stars 216 forks source link

Questions about pretraining dataset preparation #57

Open mansimane opened 3 years ago

mansimane commented 3 years ago
  1. For openwebtext dataset, there seem to be two sources. Which source was used?
    1. Download dataset directly from : https://skylion007.github.io/OpenWebTextCorpus/
    2. Download dataset from the urls given in this document https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext
  2. Was the deduplication done on urls or by doing LSH on documents?
  3. Did you cleanup the dataset as per section “Prepare the data for GPT-2 training” step 1 in this doc? Could you provide any extra cleanup steps that you performed on the dataset?