mt-upc / SHAS

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation
MIT License
36 stars 4 forks source link

Hybrid W2V pad token is hardcoded #2

Closed jairsan closed 1 year ago

jairsan commented 1 year ago

The pad token for the wav2vec hybrid segmentation method is hardcoded to the token "\<pad>".

https://github.com/mt-upc/SHAS/blob/a64a70f8571f7b154dadf205203a04d151448d5b/src/segmentation_methods/utils.py#L333

This causes problems if we load a model that used a different pad token. For example, PereLluis13/Wav2Vec2-Large-XLSR-53-catalan uses "\<PAD>" instead. In this case, then the prediction will always be "0".

This can be fixed in 1 line by doing the comparison against processor.tokenizer.pad_token.

Before the fix:

[{duration: 10.06, offset: 0.0, rW: 0, speaker_id: NA, uW: 0, wav: Debate24_726.13_2273.72.wav},
  {duration: 10.06, offset: 9.94, rW: 0, speaker_id: NA, uW: 0, wav: Debate24_726.13_2273.72.wav},
  {duration: 10.06, offset: 19.94, rW: 0, speaker_id: NA, uW: 0, wav: Debate24_726.13_2273.72.wav},
...

After the fix:

[{duration: 2.82, offset: 0.0, rW: 0, speaker_id: NA, uW: 0, wav: Debate24_726.13_2273.72.wav},
  {duration: 9.74, offset: 3.5, rW: 0, speaker_id: NA, uW: 0, wav: Debate24_726.13_2273.72.wav},
  {duration: 2.44, offset: 13.76, rW: 0, speaker_id: NA, uW: 0, wav: Debate24_726.13_2273.72.wav},
...

Fixed in #3

johntsi commented 1 year ago

Hi Javier, true, thanks for pointing this out :)