Questions about pretraining dataset preparation

For openwebtext dataset, there seem to be two sources. Which source was used?
1. Download dataset directly from : https://skylion007.github.io/OpenWebTextCorpus/
2. Download dataset from the urls given in this document https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext
Was the deduplication done on urls or by doing LSH on documents?
Did you cleanup the dataset as per section “Prepare the data for GPT-2 training” step 1 in this doc? Could you provide any extra cleanup steps that you performed on the dataset?

microsoft / DeBERTa