tunib-ai / parallelformers

Parallelformers: An Efficient Model Parallelization Toolkit for Deployment
https://tunib-ai.github.io/parallelformers
Apache License 2.0
776 stars 61 forks source link

AssertionError: Model should be on CPU before parallelization. It is more memory-efficient. #16

Closed juliensalinas closed 2 years ago

juliensalinas commented 2 years ago

Hello, first of all congratulations for this amazing project. It's simple, efficient and versatile. Very useful.

In some cases, it happens that one has several GPUs, but not enough RAM to parallelize the model. When loading the model on GPU, and then parallelizing, I'm getting the below error: AssertionError: Model should be on CPU before parallelization. It is more memory-efficient.

It doesn't stop the script, but it seems that the parallelization fails.

My question is: is it possible to load the initial model on GPU instead of CPU (even if it's not memory-efficient) or not at all?

Thanks!

hyunwoongko commented 2 years ago

yes it's possible. but I wonder your script. could you share your code about that case?

juliensalinas commented 2 years ago

Thanks for your quick reply.

Here is what I'm doing:

from parallelformers import parallelize

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B").cuda()
parallelize(model, num_gpus=2, fp16=True, verbose='detail')

It returns AssertionError: Model should be on CPU before parallelization. It is more memory-efficient. and then it is hanging forever.

hyunwoongko commented 2 years ago

Your code works like the following.

step 1 -> AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
# load full model to cpu

step 2 -> .cuda()
# move full model to gpu:0

step 3 -> parallelize(model, num_gpus=2, fp16=True, verbose='detail')
# split model in gpu:0 into 2 pieces and move to each gpu (0, 1).

I think the step 2 is not good choice. because If you can move full model to one gpu, parallelism is not needed. right? So I recommend like the following.

from parallelformers import parallelize

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
parallelize(model, num_gpus=2, fp16=True, verbose='detail')

Then, code works like the following

step 1 -> AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
# load full model to cpu

step 2 -> parallelize(model, num_gpus=2, fp16=True, verbose='detail')
# split model in cpu into 2 pieces and move to each gpu (0, 1).

How about this way? :)

hyunwoongko commented 2 years ago

That's why I added exception there. I wanted to avoid mistakes by people unfamiliar with torch.

hyunwoongko commented 2 years ago

How about using Transformers' low_cpu_mem_usage if you run out of cpu memory? Instead, the loading speed is slower. I recommend the following code to you. I think it's best way for low cpu memory.

from parallelformers import parallelize

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B", low_cpu_mem_usage=True)
parallelize(model, num_gpus=2, fp16=True, verbose='detail')
hyunwoongko commented 2 years ago

My question is: is it possible to load the initial model on GPU instead of CPU (even if it's not memory-efficient) or not at all?

We have discussed several times to solve this problem. here is that discussion. https://github.com/pytorch/pytorch/issues/64327 https://github.com/huggingface/transformers/issues/13548

This issue should be solved on the pytorch side. :( not transformers side.

On the other hand, on the deepspeed side, there is a code designed so that the divided model can be uploaded directly to the gpu. (deepspeed.zero.Init) I don't know much about the internal implementation, but it would be good to refer to.

https://deepspeed.readthedocs.io/en/latest/zero3.html#constructing-massive-models

juliensalinas commented 2 years ago

Thanks @hyunwoongko . I totally understand the above. The code I showed you was a simplification because I didn't want to waste your time, but here is my actual use case:

I have a GPT-J model. As I keep having a high RAM usage when loading the model, even when using the low_cpu_mem_usage=True option, the trick I'm using is the following. First I'm saving the model to disk:

generator = GPTNeoForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")
generator.half().cuda()
torch.save(generator, "gpt_j.pt")

Then I'm loading it in my web server:

generator = torch.load('gpt_j.pt')
generator.cuda()

This is a hack but thanks to this trick I can load the model on GPU directly without using an RAM (almost).

Now I want to do this:

generator = torch.load('gpt_j.pt')
generator.cuda()
parallelize(generator, num_gpus=2, fp16=True, verbose='detail')

The reason is that this model fits into 1 single GPU when used normally, but it quickly needs 2 GPUs for big usage (in case of a big text input for example).

Please let me know if unclear!

Thanks ;)

juliensalinas commented 2 years ago

And thanks for the insights about Deepspeed and the ongoing discussions with Pytorch and Transformers. Very useful.

hyunwoongko commented 2 years ago

how about using revision?

>>> from transformers import GPTJForCausalLM
>>> import torch

>>> model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)
hyunwoongko commented 2 years ago

don't move model that didn't parallelize to gpu. why you move full model into gpu?

hyunwoongko commented 2 years ago
generator = torch.load('gpt_j.pt')
generator.cuda()
parallelize(model, num_gpus=2, fp16=True, verbose='detail')

where is model?

juliensalinas commented 2 years ago

Sorry that was a typo, I just edited my example

hyunwoongko commented 2 years ago

How to save model using torch.save ? is it available? in general, we save state dict using torch.save no?

juliensalinas commented 2 years ago

Using revision works pretty well:

model = GPTJForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", revision="float16", torch_dtype=torch.float16, low_cpu_mem_usage=True)

But it still takes 12GB of RAM before loading on GPU.

juliensalinas commented 2 years ago

Yes torch.save() works well in my case. My understanding is that it's basically pickling the whole model.

hyunwoongko commented 2 years ago

Oh it works. (I didn't know about it)

hyunwoongko commented 2 years ago

Then it will be possible for us to upload the model directly to the gpu. I'll remove assertion for this case. thanks.

hyunwoongko commented 2 years ago

I updated ! please upgrade library using pip install parallelformers --upgrade https://github.com/tunib-ai/parallelformers/releases/tag/v1.2

juliensalinas commented 2 years ago

Testing it right now!

juliensalinas commented 2 years ago

It works great! Thanks for the quick addition! 🥇

Thanks again for the great work, that's very useful.

juliensalinas commented 2 years ago

@hyunwoongko I now realize that the above works on 1 GPU (num_gpus=1) but not on multiple GPUs. I'm getting the following error when setting num_gpus=2:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument index in method wrapper__index_select)

Do you have an idea why?

Thanks!

hyunwoongko commented 2 years ago

Hello @juliensalinas. What do you want to do?

juliensalinas commented 2 years ago

Hello @hyunwoongko thanks for your response.

I'm trying to load GPT-J fp16 on a Tesla T4 GPU, and then parallelize so the model is split into 2 x Tesla T4 GPUs.

The solution you implemented above in v1.2 worked for 1 GPU, but when using it on 2 GPUs, I'm getting the error found at least two devices, cuda:0 and cuda:1! Maybe it's harder than expected and parallelformers can only be used if the model is initially loaded on a CPU, not a GPU?

Thanks @hyunwoongko !

hyunwoongko commented 2 years ago

How about this?

juliensalinas commented 2 years ago

Let me try that

juliensalinas commented 2 years ago

Hmmm, unfortunately, torch.save(model) returns AttributeError: Can't pickle local object 'parallelize.register_hijack_methods.<locals>.<lambda>' ...

hyunwoongko commented 2 years ago

Hmm.. originally, we didn't design parallelformers like that.

hyunwoongko commented 2 years ago

Maybe the lambda function works if you pickle it using dill, but torch.save is probably not designed that way, so I think that's hard.. How about using model.parallelize() function?

Transformers already supports model parallelism without parallelformers. (However, very few models are supported, T5, GPT2, GPTJ)

juliensalinas commented 2 years ago

Thanks a lot @hyunwoongko . I will try the above and close this issue. I think my request goes beyond the scope of parallelformers. Thanks again!