Can I load the inseq model onto multiple GPUs?

frankdarkluo commented 7 months ago

Question

When I load inseq onto a LLM that is loaded onto two GPUs shown below

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map='auto',
    # load_in_8bit=True,
    cache_dir='/mnt/nvme/xxx/',
    torch_dtype=torch.float16,
)
qa_model = inseq.load_model(model, "attention", tokenizer=model_name, tokenizer_kwargs={"legacy": False})``

Then I got the OOM error as

Traceback (most recent call last): File "/mnt/nvme/xxx/inseq/examples/main.py", line 90, in qa_model = inseq.load_model(model, "attention", tokenizer=model_name, tokenizer_kwargs={"legacy": False}) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/init.py", line 47, in load_model return FRAMEWORKS_MAP[framework].load(model, attribution_method, kwargs) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/huggingface_model.py", line 154, in load return HuggingfaceDecoderOnlyModel( File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/huggingface_model.py", line 482, in init super().init(model, attribution_method, tokenizer, device, model_kwargs, tokenizer_kwargs, kwargs) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/huggingface_model.py", line 132, in init self.setup(device, attribution_method, kwargs) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/attribution_model.py", line 241, in setup self.device = device if device is not None else get_default_device() File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in setattr super().setattr(name, value) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/huggingface_model.py", line 169, in device self.model.to(self._device) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/accelerate/big_modeling.py", line 454, in wrapper return fn(*args, *kwargs) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2556, in to return super().to(args, kwargs) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1152, in to return self._apply(convert) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) [Previous line repeated 2 more times] File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/torch/nn/modules/module.py", line 825, in _apply param_applied = fn(param) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1150, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 86.00 MiB. GPU 0 has a total capacity of 47.54 GiB of which 40.81 MiB is free. Including non-PyTorch memory, this process has 47.49 GiB memory in use. Of the allocated memory 46.44 GiB is allocated by PyTorch, and 280.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

How can I solve this problem?

Is there any tutorial or link for possible solution?

Checklist

[✅] I've searched the project's issues.

gsarti commented 7 months ago

Hi! Thanks for reporting this, indeed it seems that the usage with multi-gpu is not working if quantization is not specified due to a bad device casting. Could you try to checkout the branch of PR #264 to see if that works for you?

frankdarkluo commented 7 months ago

Hi! Thanks for reporting this, indeed it seems that the usage with multi-gpu is not working if quantization is not specified due to a bad device casting. Could you try to checkout the branch of PR #264 to see if that works for you?

Thanks! I tried out the branch of PR #264 using the same code above, and I get this error now:

WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu. ⠋ Loading model with attention method...WARNING:accelerate.big_modeling:You shouldn't move a model that is dispatched using accelerate hooks. Traceback (most recent call last): File "/home/gluo/inseq/examples/main.py", line 90, in qa_model = inseq.load_model(model, "attention", tokenizer=model_name, tokenizer_kwargs={"legacy": False}) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/init.py", line 47, in load_model return FRAMEWORKS_MAP[framework].load(model, attribution_method, kwargs) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/huggingface_model.py", line 154, in load return HuggingfaceDecoderOnlyModel( File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/huggingface_model.py", line 482, in init super().init(model, attribution_method, tokenizer, device, model_kwargs, tokenizer_kwargs, kwargs) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/huggingface_model.py", line 132, in init self.setup(device, attribution_method, **kwargs) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/attribution_model.py", line 241, in setup self.device = device if device is not None else get_default_device() File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in setattr super().setattr(name, value) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/huggingface_model.py", line 169, in device self.model.to(self._device) File "/opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/accelerate/big_modeling.py", line 453, in wrapper raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")

I think the problem is caused by self.model.to(self._device)?

frankdarkluo commented 7 months ago

Is it possible that something should be added here? https://github.com/inseq-team/inseq/blob/fix-device-map-multigpu/inseq/models/huggingface_model.py#L170--L172

gsarti commented 7 months ago

Hey @frankdarkluo, you're right, the setter is actually the one in the HuggingfaceModel class. I applied a fix that should prevent the move to GPU operation in case a device map is specified, could you try it again?

frankdarkluo commented 7 months ago

Thanks! I think it is working now!

inseq-team / inseq

Can I load the inseq model onto multiple GPUs? #263

Question

Checklist