microsoft / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.89k stars 344 forks source link

Links in dataset/download_books.sh are broken. #233

Open Zeyu-ZEYU opened 1 year ago

Zeyu-ZEYU commented 1 year ago
# dataset/download_books.sh
wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.bin
wget https://the-eye.eu/public/AI/pile_neox/data/BookCorpusDataset_text_document.idx

The links above are not accessible now, which are also in examples_deepspeed/sequence_parallel/ds_pretrain_gpt_1.3B_seq_parallel_32k.sh

chrysantd commented 1 year ago

Hi, are there any alternative links or other ways to download data now?

leocnj commented 1 year ago

image The entire pile_neox is gone now.

zxmeng98 commented 1 year ago

I also encountered this problem. Can someone provide the dataset files?

Fridge003 commented 11 months ago

I also encounter this issue, is there any way to fix it.

sudarshanintel commented 11 months ago

I also encounter this issue, is there any way to fix it.?

IKACE commented 9 months ago

Sorry I also encounter this issue, could any developer kindly tell us how to fix this?

loadams commented 8 months ago

The original bookcorpus dataset is no longer available, but there are equivalents and steps to reproduce the data:

https://towardsdatascience.com/replicating-the-toronto-bookcorpus-dataset-a-write-up-44ea7b87d091 https://huggingface.co/datasets/bookcorpus