Question: update preprocessing scripts to use HuggingFace datasets for pretraining?

adammoody commented 3 years ago

Collecting the datasets needed for pretraining is a bit of work, especially when downloading from lots of different URLs behind a firewall.

https://github.com/microsoft/DeepSpeedExamples/tree/25d73cf73fb3dc66faefa141b7319526555be9fc/Megatron-LM-v1.1.5-ZeRO3#datasets

I see that some version of these seem to be available in HuggingFace datasets repo, like openwebtext.

https://huggingface.co/datasets/openwebtext

For the above, it's especially nice since @stas00 has a small subset one can use for testing:

https://huggingface.co/datasets/stas/openwebtext-10k

It's pretty straight-forward to extend the preprocessing script to use the HF datasets as a source rather than a json file. Would something like that be acceptable as a PR?

adammoody commented 3 years ago

There is also a HF dataset for wikipedia:

https://huggingface.co/datasets/wikipedia

Though it may be a bit out of date, since the source URL seems to throw a 404 for me:

https://dumps.wikimedia.org/enwiki/20200501/dumpstatus.json

https://github.com/huggingface/datasets/blob/41e87c7d72789b88aee0957e3723bc416dda24a2/datasets/wikipedia/wikipedia.py#L360

Newer dates for that URL are good, like https://dumps.wikimedia.org/enwiki/20210801/dumpstatus.json

stas00 commented 3 years ago

@adammoody,

First, the Megatron part of this repo is outdated. The new repo is at https://github.com/microsoft/Megatron-DeepSpeed/

Perhaps the Deepspeed team could update this current repo to flag that this is outdated and point to the new location. I guess it'd require updating the deepspeed docs as well to point to the new dedicated repo. cc: @tjruwase, @jeffra

Second, the BigScience project in turn forked https://github.com/microsoft/Megatron-DeepSpeed/ with https://github.com/bigscience-workshop/Megatron-DeepSpeed and we are massively updating/improving many of the scripts.

It surely is a good idea to work with the datasets directly and save the time needed to convert and write out huge jsonl files. e.g. last we processed OSCAR and ended up with 1.2TB jsonl file which was then converted to megatron index using sharding. Please see: https://github.com/bigscience-workshop/bigscience/tree/master/data/oscar

So perhaps it's best to join the collective effort of multiple innovations of the BigScience fork of Megatron and then backport any improvements back to Megatron. It's of course your call.

e.g. I'm gathering improvements and fixes to send upstream here: https://github.com/bigscience-workshop/Megatron-DeepSpeed/issues/10

In open PRs you will find several ongoing efforts to speed up the pre-processing: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pulls

adammoody commented 3 years ago

Great! Thanks for all of the links @stas00 . I will take a look and switch over.

adammoody commented 3 years ago

PR for this work: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/48

glennmatlin commented 2 years ago

@stas00 @adammoody How hard would it be to bring in the changes from bigscience-workshop/Megatron-DeepSpeed#48 to this repo? Allowing DeepSpeed examples to use HuggingFace would be great for people engaged in research who constantly need to re-train these kinds of models.

stas00 commented 2 years ago

Ideally all new development should go into: https://github.com/microsoft/Megatron-DeepSpeed/ and not DSE as DSE is very outdated. But I'm not a maintainer of these so it's up to the maintainers to speak up.

But why won't you use https://github.com/bigscience-workshop/Megatron-DeepSpeed - it's the most cutting edge version at the moment with many features that aren't yet in https://github.com/microsoft/Megatron-DeepSpeed/

The progression so far is:

msft/DSE -> msft/Megatron-DeepSpeed -> bigscience-workshop/Megatron-DeepSpeed

e.g. Curriculum Learning at the moment is only available in the latter repo. and we have bitsandbytes integrated as well!

stas00 commented 2 years ago

Specifically to merge @adammoody's work into https://github.com/microsoft/Megatron-DeepSpeed/ do:

git clone https://github.com/microsoft/Megatron-DeepSpeed/
cd Megatron-DeepSpeed
git remote add other https://github.com/bigscience-workshop/Megatron-DeepSpeed
git fetch other
git cherry-pick 5069622
<fix conflicts if any>
git commit
git push

You can try that on DSE as well, but I have no idea if it'd merge easily.

conglongli commented 2 years ago

(A bit off-topic but curriculum learning is actually available in all three repos :) https://github.com/microsoft/DeepSpeedExamples/tree/master/Megatron-LM-v1.1.5-ZeRO3/curriculum_learning, https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples/curriculum_learning, and https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/examples/curriculum_learning. The later two are basically the same, and the difference between the first two can be found at https://www.deepspeed.ai/tutorials/curriculum-learning/)

conglongli commented 2 years ago

Regarding the relationship between the three repos, on Microsoft DeepSpeed side we do plan to make the https://github.com/microsoft/Megatron-DeepSpeed the only showcase for DeepSpeed for Megatron examples, because the hard copies in DeepSpeedExamples are just too hard to maintain. https://github.com/microsoft/DeepSpeedExamples will still be used for all non-Megatron examples. We suppose to sync https://github.com/microsoft/Megatron-DeepSpeed and https://github.com/bigscience-workshop/Megatron-DeepSpeed regularly, but right now they are a bit out of sync. Our team's current limited bandwidth makes it hard for us to keep introducing new features to our examples (like the https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/48 mentioned above), so we definitely welcome contributions on those :)

stas00 commented 2 years ago

oh, I missed the fact that you added CL to microsoft/Megatron-DeepSpeed - awesome!

Additionally you would want to sync https://github.com/microsoft/Megatron-DeepSpeed with the upstream https://github.com/NVIDIA/Megatron-LM since it's quite out of sync there as well.

conglongli commented 2 years ago

Right, let me raise this todo in our team.

glennmatlin commented 2 years ago

Thank you for the blazing fast reply @stas00 @conglongli Apologies for the confusion — I am specifically curious about training the example BERT model from DeepSpeedExamples (DSE) with the HuggingFace datasets. The reason I ask is because the BERT Pre-Training example currently lacks directions for dataset pre-processing

From https://www.deepspeed.ai/tutorials/bert-pretraining/

Note: Downloading and pre-processing instructions are coming soon.

Being able to get BERT and other DSE models working with Hugging Face datasets would unblock lots of people who want to replicate and test DeepSpeed. Or updating the website/code to make that process clear for others.

Having a simple way to train BERT with Hugging Face datasets using DeepSpeeds actually very useful for PhD research teams like my own. Many people often train ‘examples’ like BERT many times for various uses. In my case our lab is researching using massive ensembles of BERT models which requires a lot of training time for all the networks.

As a result, making the data pre-processing for DSE examples very simple is a great way to encourage DeepSpeed adoption by students/researchers.

Thank you for your contributions!

microsoft / DeepSpeedExamples

Question: update preprocessing scripts to use HuggingFace datasets for pretraining? #120