Distributed Training using the same loading method

ananda1996ai commented 1 year ago

I tried to use the same model loading method as in the bloom-accelerate-inference.py script and then instead of the generate function added a Trainer with data loaders to train few layers of the model (others were frozen). I set the local_rank argument in TrainingArgs and also set trainer.is_model_parallel to True.

I got the following error:

File "/----/Anandamoy/anaconda3/envs/my_env/lib/python3.8/site-packages/torch/nn/functional.py", line 2503, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument weight in method wrapper__native_layer_norm)

Could you please suggest what I might be doing wrong and what would be the correct way to use the loaded distributed model for training/finetuning?

mayank31398 commented 1 year ago

Hi @ananda1996ai Can you tell me a bit more? What model you want to train, your system config, number of GPUs etc. Also, the way you are using accelerate for training is wrong. It won't work that way.

Also, I would suggest to use deepspeed for training directly (its a bit complicated but not too much). Or use accelerate wrapper with deepspeed backend. Tutorial here: https://huggingface.co/docs/accelerate/usage_guides/deepspeed

mayank31398 commented 1 year ago

closing this :)

huggingface / transformers-bloom-inference

Distributed Training using the same loading method #61