khakhulin / compressed-transformer

Compression of NMT transformer model with tensor methods
MIT License
46 stars 9 forks source link

High GPU memory consumption #7

Open saareliad opened 5 years ago

saareliad commented 5 years ago

Hi, I tried to integrate the TTLinear layer into TransformerXL, however I found that it consumes much more memory than usual. Couldn't even train it.

Model before compression was 151M params, after compression was 124M params. It even consumed much more memory in inference - 3021MB for the compressed model versus 2132MB for the normal model.

I also tried to write the "forward" method more efficiently (e.g with bmm) , it didn't help too.

Did you experience such problems? do you know anyway around this? Thanks,

khakhulin commented 5 years ago

Hi!

Could you please share information about your GPU, batch size. Did you try to compress only part of the layers and off compress mode for attention matrixes (see arguments)? How many GPU did you use?

Also, I carried out some experiments with transformer-XL in January. At the end of August, I'll try to find my code.

Hint: if you want to increase the compressed ratio, you can also try to compressed embedding (or protection matrixes ) in the same way.

saareliad commented 5 years ago

BTW I tried to profile the memory, most of it allocated when doing matmuls.

khakhulin commented 5 years ago

It's strange, I'll see on the implementation of tt-layer one more time.

Unfortunately, I didn't find the code for LM, but maybe you will be interested in work from NeurIps 2020 which have the same idea for LM. A Tensorized Transformer for Language Modeling You could ask the authors to share their code.

saareliad commented 5 years ago

I partially solved it - (reconstruct the matrix the do the operation) solution suggested in t3nsor repo. I implemented it to my custom code and it worked too.

Btw I read that paper you mentioned. (And also emailed the author who didn't answer, and I think he had a good reason) I found a lot of problems with it. The code they published is very bad to say the least. One of the proofs is completely incorrect.

Check the redit thread on it https://www.reddit.com/r/MachineLearning/comments/c4zxc6/r_a_tensorized_transformer_for_language_modeling/ We're I publicly shared some of my critsizem. (Which some of it also comes from my team)

Without working code to prove - I don't believe anything they say.

I even implemented this myself based on the paper and found more nonsense they did...