microsoft / UniSpeech

UniSpeech - Large Scale Self-Supervised Learning for Speech
Other
406 stars 71 forks source link

Huggingface sat model missing tokenizer #15

Closed bagustris closed 2 years ago

bagustris commented 2 years ago

I tried to use pretrained model from huggingface, it seems no tokenizer uploaded there.

>>> processor = Wav2Vec2Processor.from_pretrained('microsoft/unispeech-sat-base-plus')
OSError: Can't load tokenizer for 'microsoft/unispeech-sat-base-plus'. Make sure that:

- 'microsoft/unispeech-sat-base-plus' is a correct model identifier listed on 'https://huggingface.co/models'
  (make sure 'microsoft/unispeech-sat-base-plus' is not a path to a local directory with something else, in that case)

- or 'microsoft/unispeech-sat-base-plus' is the correct path to a directory containing relevant tokenizer files

(1) Any workaround? (2) Also, since I don't need tokenizer (used for audio classification), is there any option to disable obtaining tokenizer?

cc @patrickvonplaten

patrickvonplaten commented 2 years ago

Hey @bagustris,

Could you instead use the following:

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/unispeech-sat-base-plus")

and use the feature_extractor as the class to process the audio?

As you said the model doesn't have a tokenizer so we can simply use the feature extractor here

bagustris commented 2 years ago

Hey @bagustris,

Could you instead use the following:

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/unispeech-sat-base-plus")

and use the feature_extractor as the class to process the audio?

As you said the model doesn't have a tokenizer so we can simply use the feature extractor here

Hi @patrickvonplaten,

Thank you for the solution! it works! I think it should be clearly explained in the Transformer documentation. E.g., https://huggingface.co/transformers/model_doc/unispeech_sat.html

I tried several audio embeddings including unispeechSat model above and facing the same errors. Your answer is what I looking for.

patrickvonplaten commented 2 years ago

Great! In case you want to use for model for audio-classification, the example doc of this section: https://huggingface.co/transformers/model_doc/unispeech_sat.html#unispeechsatforsequenceclassification should be helpful :-) Could you take a look and let me know if it's useful? :-)

bagustris commented 2 years ago

Yes, that is useful and throws no error.

But the point is to use transformers as feature extractor (i.e., extract audio embedding like hubert, unispeech,wav2vec, etc) not as predictors (aka model). The trend is that the larger audio embedding (hubert-large, wav2vec2-large, unispeechSat-large) tends to obtain better performance, based my own experience. And some of those models result in error as when I used it as processor due to lack of tokenizer. I think I should use Wav2Vec2FeatureExtractor instead of Wav2Vec2Processor. But your solution works!

Again, thanks for the great job!