First, I would like to thank the authors of this paper for releasing their source code.
Is there a plan to use the same approach using a Universal Transformer as base architecture? Would the adaptive computation time (ACT) mechanism transfer to other tasks?
And more importantly, if this new transformer can be used, do you think the gain would be noticeable?
Hello,
First, I would like to thank the authors of this paper for releasing their source code.
Is there a plan to use the same approach using a Universal Transformer as base architecture? Would the adaptive computation time (ACT) mechanism transfer to other tasks?
And more importantly, if this new transformer can be used, do you think the gain would be noticeable?