Closed sharlec closed 1 year ago
MP affects the number of GPUs onto which the model is distributed. Since the distribution is now adjusted according to the number of spawns (torchrun instances) with wrapyfi, you always set MP to 1, regardless of the model size variant chosen. To work with the 13B model variant or larger, you must reshard (linked and described in the readme) the checkpoint first
Did you change the Model Parallel(MP) value for 7B? I think they did tensor parallel and may require to modify the model to match the MP with number GPUs