Compound words - Githubissues

I was wondering if there is a good way to train the routing transformer (or x-transformers) on a 3d tensor input, like they do in the Compound Word Transformer. Instead of single tokens, token groups are fed into the model, and they are encoded into a single embedding.

I put together elements from x-transformers and compound word transformer to create a custom implementation. It works but it seems a bit messy.

Now I wanted to move this approach to the routing transformer, and was wondering what you think would be a good way to implement this cleanly?

lucidrains / routing-transformer

Compound words #31