This is my own implementation of Rawformer model
(LEVERAGING POSITIONAL-RELATED LOCAL-GLOBAL DEPENDENCY FOR SYNTHETIC SPEECH DETECTION - Xiaohui Liu, Meng Liu, Longbiao Wang, Kong Aik Lee2, Hanyi Zhang1, Jianwu Dang)
WARNING
In the paper, authors developed three types of Rawformer, Rawformer-S, Rawformer-L and SE-Rawformer.
I implemented all of these models only with 1-dimesional positional encoding.
N
is the number of Conv2D-based Blocks and M
is the number of Transformer Encoders.
N
= 4M
= 2Conv2D-based Block
- same as a ResNet block used in AASISTN
= 6M
= 3Conv2D-based Block
- same as a ResNet block used in AASISTN
= 4M
= 2Conv2D-based Block
- replaced blocks of Rawformer-S with Res-SERes2Net blocks for last three blocks