Huggingface compatible implementation of RetNet (Retentive Networks, https://arxiv.org/pdf/2307.08621.pdf) including parallel, recurrent, and chunkwise forward.
Update the implementation with the official code found in torchscale. Mostly copy and paste, but change to more readable variable names, add more notes, handle retention_mask, etc.
It also includes torchscale/ directory, which consists of the copy & paste of the original implementation without xmoe related stuff, and args renamed to config to reduce diff showing up. Also, the class names have been modified that it matches with my implementation.
tests/ checks that my implementation vs torchcale version work exactly the same functionally.
the model weights trained with torchscale can be converted with convert_weights.py script.
Currently, parallel and recurrent forward are the same. Chunkwise is different: check my note in #10. (This is also present in the official implementation). Anyways, the implementation is comparable with the torchscale code.
Update the implementation with the official code found in torchscale. Mostly copy and paste, but change to more readable variable names, add more notes, handle
retention_mask
, etc.It also includes
torchscale/
directory, which consists of the copy & paste of the original implementation withoutxmoe
related stuff, andargs
renamed toconfig
to reduce diff showing up. Also, the class names have been modified that it matches with my implementation.tests/
checks that my implementation vs torchcale version work exactly the same functionally.the model weights trained with torchscale can be converted with
convert_weights.py
script.Currently, parallel and recurrent forward are the same. Chunkwise is different: check my note in #10. (This is also present in the official implementation). Anyways, the implementation is comparable with the torchscale code.