issues
search
microsoft
/
DeBERTa
The implementation of DeBERTa
MIT License
1.91k
stars
216
forks
source link
Questions about pretraining dataset preparation
#57
Open
mansimane
opened
3 years ago
mansimane
commented
3 years ago
For openwebtext dataset, there seem to be two sources. Which source was used?
Download dataset directly from :
https://skylion007.github.io/OpenWebTextCorpus/
Download dataset from the urls given in this document
https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext
Was the deduplication done on urls or by doing LSH on documents?
Did you cleanup the dataset as per section “Prepare the data for GPT-2 training” step 1 in this
doc
? Could you provide any extra cleanup steps that you performed on the dataset?