Error with vocabularies when using Megatrom-LM

Hi guys,

I followed this guide to pre-train a GPT-2 model using Accelerate with Megatron as backend. The current version of Megatron is core_r0.7.0, but I decided to use the same used in the guide (core_r0.5.0) to avoid any compatibility problem. As recommended in the guide, I use this script to get the full implementation.

For a reason that I don't understand, Megatron requires to pass the vocabulary files (vocab_file.json, merge_file.txt) and the only way that I found to do this was directly modifying the acceleratoy.py module by fixing the file paths previous to call megatron_lm_initialize. Something like this:

# initialize megatron-lm
megatron_lm_default_args = megatron_lm_plugin.megatron_lm_default_args
megatron_lm_default_args = {
    **megatron_lm_default_args,
    "vocab_file": "./vocabs/gpt2-vocab.json",
    "merge_file": "./vocabs/gpt2-merges.txt"
}
megatron_lm_plugin.megatron_lm_default_args = megatron_lm_default_args

megatron_lm_initialize(self, args_defaults=megatron_lm_plugin.megatron_lm_default_args)

Does this scenario make sense for you? What could be a smart way to do that from the main script?

Thanks.

Version of relevant libraries: accelerate==0.33.0 datasets==2.20.0 megatron_core==0.5.0 transformer_engine==1.8.0+3ec998e transformers==4.43.2 flash-attn==2.6.3 torch==2.4.0

Expected behavior

Passing vocab_file and merge_file as arguments of the main script, or taking directly from the pretrained tokenizer.

huggingface / accelerate

Error with vocabularies when using Megatrom-LM #2965

Expected behavior