huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.21k stars 26.6k forks source link

[performance] from_pretrained is still much slower than torch.load and seems to be initializing weights #21913

Closed moyix closed 1 year ago

moyix commented 1 year ago

System Info

Who can help?

@stas00, @patrickvonplaten

Information

Tasks

Reproduction

Loading a model with from_pretrained takes much longer than the underlying torch.load. For example, for the Salesforce/codegen-6B-mono model, CodeGenForCausalLM.from_pretrained('Salesforce/codegen-6B-mono') takes ~38 seconds, whereas torch.load() on its pytorch_model.bin takes just ~5.4 seconds. This is very similar to #9205, but is happening with the latest transformers from pip (4.26.1), so possibly a regression?

Short repro:

import time
import torch
from transformers import CodeGenForCausalLM
t1 = time.time()
CodeGenForCausalLM.from_pretrained('Salesforce/codegen-6B-mono')
t2 = time.time()
print("Load took", t2-t1, "seconds")

Prints Load took 37.78910255432129 seconds

import time
import torch
from transformers.utils import cached_file
torch.load(cached_file('Salesforce/codegen-6B-mono', 'pytorch_model.bin'))

Prints Load took 5.443041801452637 seconds

Based on profiling the HF from_pretrained script, it seems like ~75% of the time is being spent doing random initialization of weights that are about to be overwritten. This is the same problem that was fixed in PR #11471 so I'm not sure what's going on here.

Here's the cProfile output and output from gprof2dot: loadmodel_profile.txt hf_loadmodel_new.pdf

Expected behavior

from_pretrained should skip weight initialization when loading a pretrained model.

stas00 commented 1 year ago

Thank you for trying to analyse this, @moyix and for wanting to make things faster.

I dug into it and here is what I have to share with you.

What's happening for real

It's pretty clear from your profiler report that the diff comes from weights init which as you said get overwritten with weights.

Indeed this is what's happening here. Except you are mixing 2 things.

As you discovered lazy model init was implemented here https://github.com/huggingface/transformers/pull/11471 and it later was improved upon in multiple PRs. This was done only for _init_weights functions defined in the modeling code of transformers.

Now you're forgetting about calls like

https://github.com/huggingface/transformers/blob/37e0974afcbccdc85da59d51b44e1437b6b3caea/src/transformers/models/codegen/modeling_codegen.py#L117-L119

which of course by default call their init functions:

  File "/mnt/nvme0/code/huggingface/transformers-master/src/transformers/models/codegen/modeling_codegen.py", line 117, in __init__
    self.qkv_proj = nn.Linear(self.embed_dim, self.embed_dim * 3, bias=False)
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 101, in __init__
    self.reset_parameters()
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 107, in reset_parameters
    init.kaiming_uniform_(self.weight, a=math.sqrt(5))
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/init.py", line 396, in kaiming_uniform_

So that overhead all comes from pytorch nn.Module submodules and not _init_weights defined in the modeling code of transformers.

You're wanting to use a huge 14GB model and it surely adds some 30sec to init it.

The problem is that you're comparing loading the weights only with instantiating the model plus loading the weights, so of course they aren't the same thing. But we agree that it's a pointless waste of compute and time to init weights that are going to be overwritten moments later.

To test I changed pytorch's kaiming_uniform_ to be:

def kaiming_uniform_(
    tensor: Tensor, a: float = 0, mode: str = 'fan_in', nonlinearity: str = 'leaky_relu'
):
    return tensor

and the same for uniform_ and from_pretrained was as fast as you wanted it to be.

hint: perhaps you can use it as a hack until a better solution is provided - simply monkey patch the init functions with a no-op (I hope I covered the ones that are used here).

from transformers import CodeGenForCausalLM
import torch.nn.init
torch.nn.init.kaiming_uniform_ = lambda x, *args, **kwargs: x
torch.nn.init.uniform_ = lambda x, *args, **kwargs: x
CodeGenForCausalLM.from_pretrained('Salesforce/codegen-6B-mono')

of course, I assume you are either doing inference or you have all weights in the distributed file - so no important init is missed.

this I think should give you the speed closer to torch.load

What can be done

But why you'd say can't you skip those inits?

We actually are able to do so since pytorch-1.10 where special functionality was added.

Looking at the requirements it actually appears to be possible despite needing to support pytorch<1.10 as well.

The modules will have to be adapted to meet 2 requirements: https://pytorch.org/tutorials/prototype/skip_param_init.html#updating-modules-to-support-skipping-initialization I will repaste them here:

  1. The module must accept a device kwarg in its constructor that is passed to any parameters or buffers created during construction.
  2. The module must not perform any computation on parameters or buffers in its constructor except initialization (i.e. functions from torch.nn.init).

The first one is certainly possible since doing:

-  def __init__(self, foo, bar):
+  def __init__(self, foo, bar, device=None):

should be backward compatible.

I think the 2nd requirement should be somewhat possible, but I can't speak for the multitude of models we have.

Once this is done, the rest of the from_pretrained will need to be adapted to use the device argument as in the example of the tutorial,

m = nn.Linear(10, 5, device='meta')

but of course it will be m = ModelName(..., device='meta')

I think this needs to happen sooner than later as it'd greatly simplify the various juggling we have during the loading process (after updating all the models, e.g. like low_cpu_mem_usage functionality). But needing to support torch<1.10 might make this somewhat messy. I'm not sure.

So now let me bring here @sgugger and @patrickvonplaten to take over as I'm currently working on a different project, and they can decide on whether the project is ready for this major change or not quite yet and then you can use my hack ;)

p.s. BTW, while studying your report I have invalidated your suggestion that there was a general from_pretrained regression, but to do that I had to use a different class since CodeGenForCausalLM was added only recently. I went all the way back to transformers==4.14 and t5-large loads with the same speed as the latest version.

edit Additional solutions are added in:

stas00 commented 1 year ago

I'm curious, are you doing inference or finetuning? Because for the latter usually the init overhead is usually irrelevant.

Fast loading is also important for debug.

I think I'm going to propose to pytorch this new feature:

with torch.inference:
    m = MyModel(...)

and it would just work and be really fast w/o the overhead of init'ing weights which will be overloaded from pretrained weights.

moyix commented 1 year ago

Thanks for the very comprehensive answer! That makes perfect sense :) I am indeed doing inference and trying to get the batch size correct – so having to wait a long time for the model load each attempt (only to get a CUDA out of memory error) was a bit painful.

That hack helps a lot for now, thanks!

sgugger commented 1 year ago

Using low_cpu_mem_usage=True will initialize the model on the meta device (requires Accelerate as an extra dep) and should speed up the initialization as a result. This will become the default mid-term but we need some more preparation work by making the tests more robust for from_pretrained to make sure we absolutely don't break anything.

stas00 commented 1 year ago

Some additional solutions coming from pytorch-slack where I asked this question:

  1. install pytorch-nightly from instructions at https://pytorch.org/get-started/locally/ (or if you read this later when pytorch==2.0 is released any 2.0 and higher version will do).

now you can do:

    with torch.device("cuda"):
        model = CodeGenForCausalLM.from_pretrained('Salesforce/codegen-6B-mono')

so it instantiates the model directly on your gpu and all the inits are run much faster. This solution is just a bit slower than cancelling out the init functions. plus your model will already be on gpu, so no copying overhead from cpu.

Instead of using the context manager you can just set the default device like so:

torch.set_default_device('cuda')

and you no longer need to indent your existing code.

1b. Using materialization on the meta device will be really fast as it will cancel out the init functions and won't even waste time on allocating memory for the weights:

    with torch.device("meta"):
        model = CodeGenForCausalLM.from_pretrained('Salesforce/codegen-6B-mono')

but the resulting model isn't usable right away and requires additional manipulations to materialize it on the target device with the preloaded weights. This most likely have to be done by transformers unless pytorch comes up with a magical method a user could do themselves.

credits: @alband and @stephenroller

  1. Another solution comes from https://pytorch.org/torchdistx/latest/deferred_init.html, but it requires tweaking from_pretrained to support from torchdistx.deferred_init import deferred_init, materialize_module and this experimental package isn't easy to install since it requires CUDA extensions building (though not for this functionality), so we can't make transformers depend on it. It will have to be upstreamed into pytorch first.

credits: @cbalioglu

t-vi commented 1 year ago

In extension of @stas00 's number one, one might enhance the context manager solution with a diversion of the init functions. I wrote up a bit more detail on my blog.

alexcoca commented 1 year ago

@stas00 your solution is great, tested it a bit. Is there any timeline for this feature and could one help with integration? Would be interested to know what are the team's thoughts on integrating this feature within the Trainer but also pipelines? Happy to help if I can!

stas00 commented 1 year ago

For the timeline questions we need to ask @sgugger

sgugger commented 1 year ago

The low_cpu_mem_usage=True option is already there in Transformers and usable today. Changing the default will take more time to ensure backward compatibility.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

tomwagstaff-opml commented 1 year ago

I know this issue is closed but just some relevant feedback: I'm also facing extremely slow performance with the from_pretrained method, this time in a conda environment. I tried the low_cpu_mem_usage=True solution, but this requires a more recent version of transformers than is available in the conda repos so I can't. Reported already on Stack Overflow.

TLDR: for a chunk of users (anyone who has to use a conda environment) the low_cpu_mem_usage=True parameter is not available or usable.

LysandreJik commented 1 year ago

Hey @tomwagstaff-opml, thanks for reporting.

I believe you're using the transformers version from the main channel of anaconda, but we don't (and none of the open-source project maintainers do) maintain this version. This is maintained by the anaconda team.

In our README we indicate that you should use the huggingface channel in order to install the package.

Please install it as such:

conda install -c huggingface transformers

or, alternatively, use the conda-forge channel which is also the latest version:

conda install -c conda-forge transformers
tomwagstaff-opml commented 1 year ago

Thanks for your help @LysandreJik - installing transformers from the Hugging Face channel has worked and allowed me to try out the low_cpu_mem_usage parameter

CorentinJ commented 6 months ago

@cbalioglu the torch.device context manager seems not systematically to put the weights on said device with from_pretrained

This does put the model on cuda:

import torch
from transformers import AutoModel

with torch.device("cuda"):
    model = AutoModel.from_pretrained('sshleifer/tiny-gpt2')
    print(model.device)

This keeps it on CPU:

import torch
from transformers import AutoModelForCTC

with torch.device("cuda"):
    model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2_tiny_random")
    print(model.device)
sidecus commented 4 months ago

@cbalioglu the torch.device context manager seems not systematically to put the weights on said device with from_pretrained

Observing similar behavior:

import torch
from transformers import AutoModel

with torch.device('cuda'):
    model = AutoModel.from_pretrained('microsoft/wavlm-base-plus')
    print(model.device)

OUTPUT:

cpu