huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.77k stars 941 forks source link

Cannot train quantized model with both model and data parallelism #2832

Closed JubilantJerry closed 2 months ago

JubilantJerry commented 3 months ago

I have a quantized model that is too large to fit in one GPU, and does fit in 2 GPUs. I have 4 GPUs, so the most efficient configuration is to replicate the model and use data parallel on 2 processes that each use 2 GPUs. Naive pipeline parallelism is inefficient with 4 GPUs, it will run just as fast as with 2 GPUs and the other 2 GPU's compute power is wasted. So I attempt to use data parallelism, where each process uses model parallelism. This does not work due to an explicit error check, however.

Minimal repro:

from accelerate import Accelerator
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training

model_id = "facebook/opt-350m"
accelerator = Accelerator()

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

d0 = 2 * accelerator.process_index
d1 = d0 + 1
device_map = {'model.decoder.embed_tokens': d0, 'lm_head': d0, 'model.decoder.embed_positions': d0, 'model.decoder.project_out': d1, 'model.decoder.project_in': d0, 'model.decoder.     layers.0': d0, 'model.decoder.layers.1': d0, 'model.decoder.layers.2': d0, 'model.decoder.layers.3': d0, 'model.decoder.layers.4': d0, 'model.decoder.layers.5': d0, 'model.decoder.     layers.6': 1, 'model.decoder.layers.7': d0, 'model.decoder.layers.8': d0, 'model.decoder.layers.9': d0, 'model.decoder.layers.10': d0, 'model.decoder.layers.11': d0, 'model.decoder.    layers.12': d1, 'model.decoder.layers.13': d1, 'model.decoder.layers.14': d1, 'model.decoder.layers.15': d1, 'model.decoder.layers.16': d1, 'model.decoder.layers.17': d1, 'model.       decoder.layers.18': d1, 'model.decoder.layers.19': d1, 'model.decoder.layers.20': d1, 'model.decoder.layers.21': d1, 'model.decoder.layers.22': d1, 'model.decoder.layers.23': d1}

print(accelerator.process_index, device_map)

model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device_map, load_in_8bit=True)
model = prepare_model_for_int8_training(model)

model = get_peft_model(model, config)

model = accelerator.prepare(model)
$ torchrun --nproc-per-node 2 repro.py

0 {'model.decoder.embed_tokens': 0, 'lm_head': 0, 'model.decoder.embed_positions': 0, 'model.decoder.project_out': 1, 'model.decoder.project_in': 0, 'model.decoder.layers.0': 0, 'model.decoder.layers.1': 0, 'model.decoder.layers.2': 0, 'model.decoder.layers.3': 0, 'model.decoder.layers.4': 0, 'model.decoder.layers.5': 0, 'model.decoder.layers.6': 1, 'model.decoder.layers.7': 0, 'model.decoder.layers.8': 0, 'model.decoder.layers.9': 0, 'model.decoder.layers.10': 0, 'model.decoder.layers.11': 0, 'model.decoder.layers.12': 1, 'model.decoder.layers.13': 1, 'model.decoder.layers.14': 1, 'model.decoder.layers.15': 1, 'model.decoder.layers.16': 1, 'model.decoder.layers.17': 1, 'model.decoder.layers.18': 1, 'model.decoder.layers.19': 1, 'model.decoder.layers.20': 1, 'model.decoder.layers.21': 1, 'model.decoder.layers.22': 1, 'model.decoder.layers.23': 1}
1 {'model.decoder.embed_tokens': 2, 'lm_head': 2, 'model.decoder.embed_positions': 2, 'model.decoder.project_out': 3, 'model.decoder.project_in': 2, 'model.decoder.layers.0': 2, 'model.decoder.layers.1': 2, 'model.decoder.layers.2': 2, 'model.decoder.layers.3': 2, 'model.decoder.layers.4': 2, 'model.decoder.layers.5': 2, 'model.decoder.layers.6': 1, 'model.decoder.layers.7': 2, 'model.decoder.layers.8': 2, 'model.decoder.layers.9': 2, 'model.decoder.layers.10': 2, 'model.decoder.layers.11': 2, 'model.decoder.layers.12': 3, 'model.decoder.layers.13': 3, 'model.decoder.layers.14': 3, 'model.decoder.layers.15': 3, 'model.decoder.layers.16': 3, 'model.decoder.layers.17': 3, 'model.decoder.layers.18': 3, 'model.decoder.layers.19': 3, 'model.decoder.layers.20': 3, 'model.decoder.layers.21': 3, 'model.decoder.layers.22': 3, 'model.decoder.layers.23': 3}

[rank0]: ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.
[rank1]: ValueError: You can't train a model that has been loaded with `device_map='auto'` in any distributed mode. Please rerun your script specifying `--num_processes=1` or by launching with `python {{myscript.py}}`.

This shows that I cannot use data parallel if each process loads the model with device_map.

Thing is, I did not use device_map='auto', I manually crafted the device map to make sure the different ranks use distinct GPUs. My understanding is that device_map='auto' is disallowed because it will try to make all processes use the same set of GPUs, which causes some GPUs get shared by multiple processes.

What is the reason why device_map is not allowed for distributed runs like this? I tried reading #1840, especially the response by @younesbelkada, the commit #1523, and the bug report https://github.com/huggingface/peft/issues/269#issuecomment-1498695729 . But I don't understand the root cause of the problem, conceptually it seems plausible that multiple processes can each use multiple GPUs as they see fit, and then aggregate the gradients at each backward pass as required by DDP. Is there a problem specifically with quantization?

The error in the comment 269, which is given as the motivation for disallowing the combination of device_map and torchrun, is taken out of context. It's hard to see the connection between device maps / distributed run, and the specific error RuntimeError: mat1 and mat2 shapes cannot be multiplied (12000x1 and 2x1280). There was no discussion in that thread about the true root cause of the problem, making it hard to understand why this combination is restricted.

If I can know the detail of how the error happens due to this combination, I could determine whether my scenario is in fact at risk of a similar bug, and if my scenario does in fact work if it weren't for the error check, I can make a PR to detect and bypass the error check in that scenario.

SunMarc commented 3 months ago

Hi @JubilantJerry, thanks for the detailed issue.

Thing is, I did not use device_map='auto', I manually crafted the device map to make sure the different ranks use distinct GPUs. My understanding is that device_map='auto' is disallowed because it will try to make all processes use the same set of GPUs, which causes some GPUs get shared by multiple processes.

That's indeed the case

What is the reason why device_map is not allowed for distributed runs like this? I tried reading https://github.com/huggingface/accelerate/issues/1840, especially the response by @younesbelkada, the commit https://github.com/huggingface/accelerate/pull/1523, and the bug report https://github.com/huggingface/peft/issues/269#issuecomment-1498695729 . But I don't understand the root cause of the problem, conceptually it seems plausible that multiple processes can each use multiple GPUs as they see fit, and then aggregate the gradients at each backward pass as required by DDP. Is there a problem specifically with quantization?

This is not a problem specifically with quantization. It's just that lots of users were trying to do device_map="auto" + DDP with quantized model and we had numerous issues from that. It would be awesome if you managed to made it work with a custom device_map. Would you like to try ? Otherwise, I will have a look later !

JubilantJerry commented 3 months ago

I'm currently still trying to understand the problem, in my project I just commented out the error check in the library and it did seem like I'm getting comparable training results with and without DDP even when using multiple devices per process. I can upload some loss curves comparing the results later.

But I'm wondering, even with device_map="auto", why would there be a shape mismatch bug like the one in the comment 269? I can see how combining device_map="auto" and DDP probably results in a configuration the user doesn't expect and a very slow runtime performance. But even then it shouldn't throw an exception like in the comment. I didn't get a similar exception when using my custom device map (I can't easily test device_map="auto" in my project).

I can think of several possibilities:

If the issue is simply that users don't understand what the combination of device_map="auto" with DDP does, and the library is only reminding them that their code is probably incorrect, then sure the patch to support a manual device_map seems easy to make. If it's just a matter of verifying that each process uses distinct GPUs, the patch seems doable too.

I'm worried though, if there is an actual bug in how DDP interacts with multi-GPU models, and that this restriction is added because no one had truly understood and fixed the bug yet.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.