Open saareliad opened 5 years ago
Hi!
Could you please share information about your GPU, batch size. Did you try to compress only part of the layers and off compress mode for attention matrixes (see arguments)? How many GPU did you use?
Also, I carried out some experiments with transformer-XL in January. At the end of August, I'll try to find my code.
Hint: if you want to increase the compressed ratio, you can also try to compressed embedding (or protection matrixes ) in the same way.
compressed all FF layers (only them).
The compression mode was Tensor Train (that's what you meant?)
Tested on 4 GPUs (Titan Xp),
For training: batch size=64 (4 on GPU0, 20 on other gpus), seq_len=150. Also tried to reduce the batch size, although its not desirable.
BTW I tried to profile the memory, most of it allocated when doing matmuls.
It's strange, I'll see on the implementation of tt-layer one more time.
Unfortunately, I didn't find the code for LM, but maybe you will be interested in work from NeurIps 2020 which have the same idea for LM. A Tensorized Transformer for Language Modeling You could ask the authors to share their code.
I partially solved it - (reconstruct the matrix the do the operation) solution suggested in t3nsor repo. I implemented it to my custom code and it worked too.
Btw I read that paper you mentioned. (And also emailed the author who didn't answer, and I think he had a good reason) I found a lot of problems with it. The code they published is very bad to say the least. One of the proofs is completely incorrect.
Check the redit thread on it https://www.reddit.com/r/MachineLearning/comments/c4zxc6/r_a_tensorized_transformer_for_language_modeling/ We're I publicly shared some of my critsizem. (Which some of it also comes from my team)
Without working code to prove - I don't believe anything they say.
I even implemented this myself based on the paper and found more nonsense they did...
Hi, I tried to integrate the TTLinear layer into TransformerXL, however I found that it consumes much more memory than usual. Couldn't even train it.
Model before compression was 151M params, after compression was 124M params. It even consumed much more memory in inference - 3021MB for the compressed model versus 2132MB for the normal model.
I also tried to write the "forward" method more efficiently (e.g with bmm) , it didn't help too.
Did you experience such problems? do you know anyway around this? Thanks,