huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.69k stars 26.94k forks source link

T5-11b model parallelism #7047

Closed exelents closed 3 years ago

exelents commented 4 years ago

🚀 Feature request

I would like to finetune t5-11b model on my dataset, but found that it doesn't fit in TPU or GPU memory - colab notebook just crash when I run it. I tried to find a ready model parallelism solution. First I found this PR: https://github.com/huggingface/transformers/pull/3578 but it seems it haven't released. I tried to merge it to master branch locally, and use it, but it's crashed. Also I have found Eisen library that propose "model parallelism with one code line", but works only for models with only one input ( t5 have 2 inputs - tokens and mask).

I need to distribute model on several GPU, and I see somebody tried to perform it. If this development (pull request 3578) is still in process, can you tell is there are any plans to release it?

patrickvonplaten commented 4 years ago

Hey @exelents,

yes we are still looking into a good way of doing model parallelism. Could you post the error message you received when using #3578?

exelents commented 4 years ago

Here is it

-input-22-5591bd8e45c0> in main() 143 cache_dir=model_args.cache_dir, 144 ) --> 145 model = model.spread_on_devices(['cpu', 'cpu']) 146 147 # Get datasets

/usr/local/lib/python3.6/dist-packages/transformers/modeling_t5.py in spread_on_devices(self, devices) 936 return 937 --> 938 modules_to_move = set(self.modules) 939 940 # Evenly spread the blocks on devices

TypeError: 'method' object is not iterable

As I don't have several GPU at the moment, I tried to run it on CPU (see line 145 in error stack)

drpatrickkaggle commented 4 years ago

patrickvonplaten,

The following should be interesting.

https://www.microsoft.com/en-us/research/publication/training-large-neural-networks-with-constant-memory-using-a-new-execution-algorithm/

I have engaged them and they are planning to release the open source several months back but faces some issues with Microsoft internals. Heard the author is planning to release open source themselves.

Can anyone work with them?

Cheers, Dr. Patrick

patrickvonplaten commented 4 years ago

That does look interesting. Thanks for sharing! I'm not sure if we are planning on working with the author - but feel free to reach out to him and maybe this can help resolve the T5 model parallelism.

exelents commented 4 years ago

Hello, guys. As I still need to train t5-11b, and Google doesn't want to give me access to his TPU's despite I can pay for it... So I have made some changes to T5 model to make it live on several GPU simultaneously. my fork: https://github.com/huggingface/transformers/compare/master...exelents:model_parallelism_t5

The point is: transformer blocks (T5Block) is most large parts of network. First step is to evenly spread them aross all GPUs. In the second step we spread across GPUs all other blocks of our transformer, that are incomparably smaller than main blocks. Also there are some modification of original model code to make tensors move to nesessary GPU when incoming tensor and a layer are on the different devices. Unfortunately testing this code on 8-gpu server I found that first GPU is going to spend memory resource faster than others:

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:17.0 Off | 0 | | N/A 53C P0 65W / 300W | 16108MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:00:18.0 Off | 0 | | N/A 53C P0 64W / 300W | 10224MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:00:19.0 Off | 0 | | N/A 57C P0 63W / 300W | 10224MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:00:1A.0 Off | 0 | | N/A 51C P0 64W / 300W | 10224MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 | | N/A 51C P0 63W / 300W | 13296MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 | | N/A 56C P0 65W / 300W | 13296MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 | | N/A 52C P0 62W / 300W | 13296MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 | | N/A 51C P0 64W / 300W | 13548MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

It seems in the beginning of our graph we have a large block which have a size comparable to T5Block size. The smarter way would be to split layers according to these memory usage, but I don't know a simple way to know how much memory every module use. Maybe a simple workaround would be to find which layer can use so much memory and provide it's memory in first step, with T5Block's.

What do you think about this?

exelents commented 4 years ago

I tested this script on a machine with 8x32GB GPUs and have seen the same symptoms - first gpu's memoru gets fully loaded while other GPUs consume around 5 gigabytes: https://pastebin.com/cV3CAQMk Looking on output of device assignation array I see that all layers get spreaded evenly, so I can't imagine why it consumes memory of only one GPU.... If somebody could help with this code - please tell me, I can prepare running script for you. Also, you can use my code with only one line of code:

rc = model.split_across_gpus(devices=['cuda:0', 'cuda:1','cuda:2','cuda:3', 'cuda:4', 'cuda:5', 'cuda:6', 'cuda:7',])
print(rc)
stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

LostBenjamin commented 3 years ago

Hi @exelents,

I also need model parallelism for T5 and your code should be very helpful. However, the link to your code seems invalid. Could you please share the code with me?

Best, Jingxuan

exelents commented 3 years ago

Hello, @LostBenjamin. Unfortunately, this my code didn't worked when I tested 11B model on 8 V100 GPU, so I didn't fixed it. @alexorona did some work for model parallelism, here https://github.com/huggingface/transformers/pull/9384 you can find a discussion about already existing MP in transformers library. It's about Bart, but the same functions exists in T5 model class too. There is a code to spread model on several GPUs: model.parallelize() # autogenerated inputs = inputs.to("cuda:0")

Also, you can try DeepSpeed: https://github.com/exelents/try_t5_qa I haven't used this code for model parallelism, but in DeepSpeed community people say MP is exists in this library. So maybe this repo would be helpful.

LostBenjamin commented 3 years ago

Hi @exelents,

Thanks for your help! I will try the MP in transformers library.