jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
MIT License
801 stars 44 forks source link

some questions about model #49

Closed VJJJJJJ1 closed 1 week ago

VJJJJJJ1 commented 2 weeks ago

hi, thank you for your great work, I have some questions:

  1. As you mentioned in the paper: adding the attention module before the ConvNext module appears to be the optimal solution.,but I found that in decoder/models.py, the AttnBlock is contained in pos_net which is after convnext. They seem to be in the opposite order as in the paper.
  2. I want to do streaming inference with wavtokenizer, I have replaced all the convolution layers in SEANetEncoder, SEANetDecoder, ConvNeXtBlock and pos_net(ResnetBlock+AttnBlock) with causal convolution layers(class SConv1d with causal=T). Unfortunately, the genetrator loss keeps increasing(see in the pic). Is there any wrong in the modified model?

thank you for your reply! image

jishengpeng commented 2 weeks ago

hi, thank you for your great work, I have some questions:

  1. As you mentioned in the paper: adding the attention module before the ConvNext module appears to be the optimal solution.,but I found that in decoder/models.py, the AttnBlock is contained in pos_net which is after convnext. They seem to be in the opposite order as in the paper.
  2. I want to do streaming inference with wavtokenizer, I have replaced all the convolution layers in SEANetEncoder, SEANetDecoder, ConvNeXtBlock and pos_net(ResnetBlock+AttnBlock) with causal convolution layers(class SConv1d with causal=T). Unfortunately, the genetrator loss keeps increasing(see in the pic). Is there any wrong in the modified model?

thank you for your reply! image

  1. The code is consistent with the paper, with the attention module placed before the ConvNeXt blocks. Link
  2. We have also experimented with WavTokenizer-Streaming and found the performance to be satisfactory. The issue you are encountering appears to be a bug or some other underlying cause. During modification, only the parameters of the encoder need to be adjusted, while for the decoder, detailed changes to the attention and convolution modules are required.