Closed YoungloLee closed 1 year ago
Hey, can you share your training graphs\results with your correction and without?
and also, why did you use mode='reflect' ?
Hey, can you share your training graphs\results with your correction and without?
and also, why did you use mode='reflect' ?
Without this modification, SI-SDR metric does not drop below 30dB because of the time-alignment mismatch (whenever down sampling (stride > 1) occurs). As the time-domain reconstruction loss (l1 + l2) scale is difficult to interpret, just take a look at my training SI-SDR curve. (batch size = 32 and training segment length = 1 sec, very early training stage ~12k)
Also, i just set the padding mode to be 'reflect' following FAIR's EnCodec implementation (https://github.com/facebookresearch/encodec/blob/6e8d7eda6fff5b0d589d64f063610c7f6044963e/encodec/modules/seanet.py#L95).
@YoungloLee thank you dearly for this! i believe you are correct and this is a huge misstep on my part :pray:
Also, i just set the padding mode to be 'reflect' following FAIR's EnCodec implementation
maybe i'm totally worng but doesn't reflect padding mode make it not streaming?
Also, i just set the padding mode to be 'reflect' following FAIR's EnCodec implementation
maybe i'm totally worng but doesn't reflect padding mode make it not streaming?
I think it does not matter for streaming.
Hey, doesnt it affects the CausalTransposedConv1d as well if so ?
https://github.com/lucidrains/audiolm-pytorch/blob/main/audiolm_pytorch/soundstream.py#L303-L314 In CausalConv1d class, the number of padding should be (dilation (kernel_size - 1) + 1 - stride), not just (dilation (kernel_size - 1)). Without this correction, the soundstream model does not converge. Hope this helps you.