huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.92k stars 964 forks source link

Error with vocabularies when using Megatrom-LM #2965

Closed fabiancpl closed 1 month ago

fabiancpl commented 3 months ago

Hi guys,

I followed this guide to pre-train a GPT-2 model using Accelerate with Megatron as backend. The current version of Megatron is core_r0.7.0, but I decided to use the same used in the guide (core_r0.5.0) to avoid any compatibility problem. As recommended in the guide, I use this script to get the full implementation.

For a reason that I don't understand, Megatron requires to pass the vocabulary files (vocab_file.json, merge_file.txt) and the only way that I found to do this was directly modifying the acceleratoy.py module by fixing the file paths previous to call megatron_lm_initialize. Something like this:

# initialize megatron-lm
megatron_lm_default_args = megatron_lm_plugin.megatron_lm_default_args
megatron_lm_default_args = {
    **megatron_lm_default_args,
    "vocab_file": "./vocabs/gpt2-vocab.json",
    "merge_file": "./vocabs/gpt2-merges.txt"
}
megatron_lm_plugin.megatron_lm_default_args = megatron_lm_default_args

megatron_lm_initialize(self, args_defaults=megatron_lm_plugin.megatron_lm_default_args)

Does this scenario make sense for you? What could be a smart way to do that from the main script?

Thanks.

Version of relevant libraries: accelerate==0.33.0 datasets==2.20.0 megatron_core==0.5.0 transformer_engine==1.8.0+3ec998e transformers==4.43.2 flash-attn==2.6.3 torch==2.4.0

Expected behavior

Passing vocab_file and merge_file as arguments of the main script, or taking directly from the pretrained tokenizer.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.