Official implementation

Update the implementation with the official code found in torchscale. Mostly copy and paste, but change to more readable variable names, add more notes, handle retention_mask, etc.
It also includes torchscale/ directory, which consists of the copy & paste of the original implementation without xmoe related stuff, and args renamed to config to reduce diff showing up. Also, the class names have been modified that it matches with my implementation.
tests/ checks that my implementation vs torchcale version work exactly the same functionally.
the model weights trained with torchscale can be converted with convert_weights.py script.
Currently, parallel and recurrent forward are the same. Chunkwise is different: check my note in #10. (This is also present in the official implementation). Anyways, the implementation is comparable with the torchscale code.

syncdoth / RetNet