huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.92k stars 26.27k forks source link

[ALBERT]: Albert base model itself consuming 32 GB GPU memory.. #2284

Closed jonanem closed 4 years ago

jonanem commented 4 years ago

🐛 Bug

Model I am using TFALBERT:

Language I am using the model on (English, Chinese....): English

The problem arise when using:

from transformers import TFAlbertForSequenceClassification model = TFAlbertForSequenceClassification.from_pretrained('albert-base-v2')

After this GPU memory is consumed almost 32 GB.... base V2 is model is roughly around 50 MB which is occupying 32 GB on GPU

image

Environment

dsindex commented 4 years ago

i have similar situation.

https://github.com/dsindex/iclassifier#emb_classalbert

in the paper( https://arxiv.org/pdf/1909.11942.pdf ), ALBERT xlarge has just 60M parameters which is much less than BERT large(334M)'s. but, we are unable to load albert-xlarge-v2 on 32G GPU memory. (no problem on bert-large-uncased, bert-large-cased)

matteodelv commented 4 years ago

A similar situation happened to me too. While fine-tuning Albert base on SQuAD 2.0, I had to lower the train batch size to manage to fit the model on 2x NVIDIA 1080 Ti, for a total of about 19 GB used. I find it quite interesting and weird as the same time, as I managed to fine-tune BERT base on the same dataset and the same GPUs using less memory...

birdx0810 commented 4 years ago

Same for the pytorch version of ALBERT, where my 8/11GB GPU could run BERT_base and RoBERTa.

hankcs commented 4 years ago

Interesting, I started to hesitate on using this ALBERT implementation but hope it will be fixed soon.

LysandreJik commented 4 years ago

Indeed, I can reproduce for the TensorFlow version. I'm looking into it, thanks for raising this issue.

LysandreJik commented 4 years ago

@jonanem, if you do this at the beginning of your script, does it change the amount of memory used?

gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]
)

This should keep the amount of memory allocated to the model to 1024MB, with possibility to grow if need be. Initializing the model after this only uses 1.3GB of VRAM on my side. Can you reproduce?

See this for more information: limiting gpu memory growth

matteodelv commented 4 years ago

@LysandreJik I just did some investigation and I found a similar problem with the Pytorch implementation. Model: ALBERT base v2, fine tuning on SQuAD v2 task

I used the official code from Google Tensorflow repository and I managed to fine tune it on a single GTX 1080 Ti, with batch size 16 and memory consumption of about 10 GB. Then, I used transformers Pytorch implementation and did the same task on 4x V100 on AWS, with total batch size 48 and memory consumption of 52 GB (about 13 GB per GPU).

Now, putting it in perspective, I guess the memory consumption of the Pytorch implementation is 10/15 GB above what I was expecting. Is this normal? In particular, where in the code is there the Embedding Factorization technique proposed in the official paper?

LysandreJik commented 4 years ago

Hi @matteodelv, I ran a fine-tuning task on ALBERT (base-v2) with the parameters you mentioned: batch size of 16. I end up with a VRAM usage of 11.4GB, which is slightly more than the official Google Tensorflow implementation you mention. The usage is lower than when using BERT, which has a total usage of 14GB.

However, when loading the model on its own without any other tensors, taking into account the pytorch memory overhead, it only takes about 66MB of VRAM.

Concerning your second question, here is the definition of the Embedding Factorization technique proposed in the official paper: [...] The first one is a factorized embedding parameterization. By decomposing the large vocabulary embedding matrix into two small matrices, we separate the size of the hidden layers from the size of vocabulary embedding.

In this PyTorch implementation, there are indeed two smaller matrices so that the two sizes may be separate. The first embedding layer is visible in the AlbertEmbeddings class, and is of size (vocab_size, embedding_size), whereas the second layer is visible in the AlbertTransformer class, with size (embedding_size, hidden_size).

matteodelv commented 4 years ago

Thanks for your comment @LysandreJik... I haven't looked in the AlbertTransformer class for the embedding factorization.

However, regarding the VRAM consumption, I'm still a bit confused about it. I don't get why the same model with a batch size 16 consumes about 10/11 GB on a single GPU while the same training, on 4 GPUs (total batch size 48, so it's 12 per GPUs) requires more memory.

Could you please check this? May it be related to Pytorch's DataParallel?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Shashi456 commented 4 years ago

Did @matteodelv @LysandreJik find any issue or solution for this? The memory consumption given the parameter is insane

matteodelv commented 4 years ago

Unfortunately not. I had to tune hyperparameters or use other hardware with more memory. But I was using an older version... I haven't checked if the situation has changed since then.

hfawaz commented 3 years ago

Hey, I tried running on GTX 1080 (10GB) bert-base-uncased with sucess on IMDB dataset with a batch-size equal to 16 and sequence length equal to 128. Running albert-base-v2 with the same sequence length and same batch size is giving me Out-of-memory issues.

I am using pytorch, so I guess I have the same problem as you guys here.

SaeedNajafi commented 3 years ago

Same issue. ALBERT raises OOM requiring 32G.

XikunZhang commented 2 years ago

ALBERT repeats the same parameters for each layer but increases each layer size, so even though it have fewer parameters than BERT, the memory needs are greater due to the much larger activations in each layer.

SaeedNajafi commented 2 years ago

ALBERT repeats the same parameters for each layer but increases each layer size, so even though it have fewer parameters than BERT, the memory needs are greater due to the much larger activations in each layer.

That is true, still there is need for more computation, but BERT can fit into 16G memory. I had my albert reimplemented differently and I could fit its weights on a 24G gpu.

MrShininnnnn commented 1 year ago

ALBERT repeats the same parameters for each layer but increases each layer size, so even though it have fewer parameters than BERT, the memory needs are greater due to the much larger activations in each layer.

Thanks for this explanation, which saves my life.