k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
948 stars 299 forks source link

zipformer and SSL feature adaptation #1799

Open alanshaoTT opened 2 weeks ago

alanshaoTT commented 2 weeks ago

I was wondering if zipformer has any specific modules or functions designed for fbank features? I’m using pretrained wav2vec2.0 representations as input for zipformer training, but I’m having trouble with the model’s loss not converging well. I’m following the librispeech/zipformer recipe, but when I used the same representations with librispeech/pruned_transducer_stateless7, it converged just fine. I noticed the main difference between these recipes is the zipformer encoder. Is there something specifically designed for fbank features in librispeech/zipformer that could be causing this?

marcoyang1998 commented 2 weeks ago

What do you mean by not converging well? Is it having poor WERs?

alanshaoTT commented 2 weeks ago

My model's loss is quite high, fluctuating around 2.0, and it hasn’t decreased much. The WER is also high.

alanshaoTT commented 2 weeks ago

image this is my models loss

yfyeung commented 1 week ago

This is strange. I have trained the SSL features from HuBERT / WavLM using Zipformer recipe, it converges well.

yfyeung commented 1 week ago

The frame rate of wav2vec2.0 is 50Hz? And the Fbank is 100Hz. Maybe you can simply interpolate it into 100Hz or remove the downsampling in the subsampling.py.

alanshaoTT commented 1 week ago

The frame rate of wav2vec2.0 is 50Hz? And the Fbank is 100Hz. Maybe you can simply interpolate it into 100Hz or remove the downsampling in the subsampling.py.

thanks!i will try to interpolate it into 100Hz

danpovey commented 1 week ago

If it's the latest zipformer recipe you can also try adding the option warmup_start=0.1 to the initializer of Eden (the scheduler), this sometimes helps in case of divergence; or reduce the learning rate (e.g. .045 to .035)

alanshaoTT commented 6 days ago

I am not using the latest version of Zipformer. And, I use the final layer representation of a pretrained wav2vec2.0 model, interpolate it linearly to 100 Hz, then use a linear layer to align the number of channels to 80, and finally apply layer normalization before passing it to Zipformer. I had to adjust the base learning rate to 0.0035 to address the gradient vanishing problem, which is quite strange.