Open alanshaoTT opened 2 weeks ago
What do you mean by not converging well? Is it having poor WERs?
My model's loss is quite high, fluctuating around 2.0, and it hasn’t decreased much. The WER is also high.
this is my models loss
This is strange. I have trained the SSL features from HuBERT / WavLM using Zipformer recipe, it converges well.
The frame rate of wav2vec2.0 is 50Hz? And the Fbank is 100Hz. Maybe you can simply interpolate it into 100Hz or remove the downsampling in the subsampling.py.
The frame rate of wav2vec2.0 is 50Hz? And the Fbank is 100Hz. Maybe you can simply interpolate it into 100Hz or remove the downsampling in the subsampling.py.
thanks!i will try to interpolate it into 100Hz
If it's the latest zipformer recipe you can also try adding the option warmup_start=0.1 to the initializer of Eden (the scheduler), this sometimes helps in case of divergence; or reduce the learning rate (e.g. .045 to .035)
I am not using the latest version of Zipformer. And, I use the final layer representation of a pretrained wav2vec2.0 model, interpolate it linearly to 100 Hz, then use a linear layer to align the number of channels to 80, and finally apply layer normalization before passing it to Zipformer. I had to adjust the base learning rate to 0.0035 to address the gradient vanishing problem, which is quite strange.
I was wondering if zipformer has any specific modules or functions designed for fbank features? I’m using pretrained wav2vec2.0 representations as input for zipformer training, but I’m having trouble with the model’s loss not converging well. I’m following the librispeech/zipformer recipe, but when I used the same representations with librispeech/pruned_transducer_stateless7, it converged just fine. I noticed the main difference between these recipes is the zipformer encoder. Is there something specifically designed for fbank features in librispeech/zipformer that could be causing this?