lucidrains / routing-transformer

Fully featured implementation of Routing Transformer
MIT License
282 stars 29 forks source link

LM slower than the encoder-decoder with the same depth and max_seq_len, window size #20

Open AliOskooeiTR opened 3 years ago

AliOskooeiTR commented 3 years ago

This is more of a question for sanity check than an issue. I have trained the routing transformer encoder-decoder in the past and was really impressed by the speed. I hot about 4 iter/sec training on 12000 long sequences. Now I am training a language model with a depth equal to the encoder/decoder depth of my old model and keeping all other parameters the same. The training rate for the LM has fallen below 1 iter/sec. I was wondering if this is to be expected or there may be something wrong that I need to look into. Thank you for your help.

lucidrains commented 3 years ago

@AliOskooeiTR Do you remember the version number when you trained your first encoder / decoder?

AliOskooeiTR commented 3 years ago

@lucidrains Hi Phil, I trained it in the summer with the version 0.8.8.

AliOskooeiTR commented 3 years ago

@lucidrains Any pointers on the training rate issue? Thank you!