Different number of tokens and Character Level Modeling

wajihullahbaig commented 3 years ago

Hi Thank you for the open source code. I have been using Transformers for a while now and I generally use them for character level modeling - that is, translation between two different languages. I was wondering if you could answer the following questions

1- Can I use different number of tokens for encoder and decoder? This is because two different languages will have different tokens 2- I can probably use your code for character level modeling, at what point should I split the input stream of string tokens to characters? Any particular module where you can point me to?

I hope I am not asking for much :)

Thank you!

tatp22 commented 3 years ago

Hi @wajihullahbaig!

Yes, you can use a different number of tokens for both the encoder and the decoder. Basically, what is happening is that an integer number (let's say like 10237) is being transformed into a vector of length emb_dim, and these embeddings are already precomputed for you in the LinformerLM class. This embedding makes it easier for the Linformer to process. So as long as the embedding dimension is the same between the encoder and the decoder, it should all be good! Also, if you want to change the embedding function, you can do that as well.
You can split it into chars before you feed the tensor into the module, but make sure you have some mapping to ints to them! (for example, A -> 1, B -> 2 or something like that). Make sure you set the num_tokens to the amount of chars in your alphabet. I unfortunately can't point to any specific places for you to start, since i didn't really use the Linformer for this purpose.

Hope this helped!

wajihullahbaig commented 3 years ago

Hi @wajihullahbaig!

Yes, you can use a different number of tokens for both the encoder and the decoder. Basically, what is happening is that an integer number (let's say like 10237) is being transformed into a vector of length emb_dim, and these embeddings are already precomputed for you in the LinformerLM class. This embedding makes it easier for the Linformer to process. So as long as the embedding dimension is the same between the encoder and the decoder, it should all be good! Also, if you want to change the embedding function, you can do that as well.

You can split it into chars before you feed the tensor into the module, but make sure you have some mapping to ints to them! (for example, A -> 1, B -> 2 or something like that). Make sure you set the num_tokens to the amount of chars in your alphabet. I unfortunately can't point to any specific places for you to start, since i didn't really use the Linformer for this purpose.

Hope this helped!

Thank you very much for the detailed reply. This will really help. Cheers!

tatp22 / linformer-pytorch

Different number of tokens and Character Level Modeling #24