huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.63k stars 26.92k forks source link

cannot import name 'SpeechEncoderDecoder' from 'transformers' - wav2vec2-xls-r-2b-22-to-16 #14540

Closed programmeddeath1 closed 2 years ago

programmeddeath1 commented 2 years ago

Hi, I am currently trying to run this model - facebook/wav2vec2-xls-r-2b-22-to-16 https://huggingface.co/facebook/wav2vec2-xls-r-2b-22-to-16

The example code given using the pipeline is giving significantly different results compared to the api hosted on hugging face. I recorded and sent the same audio to the api through the website as well as ran the sample code on colab. The output is quite different.

I tried running using the second step-by-step method too, it fails with "cannot import name 'SpeechEncoderDecoder' from 'transformers' "

I tried with the latest transformer library as well as 4.11.3 Could you check what could be wrong? I can share my colab if needed.

Thanks for your help in advance.

programmeddeath1 commented 2 years ago

Hi, 1) The pretrained model 'xls_r_2b_22_16.pt' is 19.6gb on the fairseq github repo. The xls_r_2b_22_16 model on the hugging face hub is 9.8GB. Could this be the issue? are these different models that were uploaded. 2) The second sample code shows SpeechEncoderDecoder, it should be SpeechEncoderDecoderModel.

Thanks.

patrickvonplaten commented 2 years ago

Hey @programmeddeath1,

Thanks for noticing the import bug. I just corrected all the model cards to import SpeechEncoderDecoderModel instead. Regarding the different results - could you give me an example so that I can debug the API vs. a code snippet given by you? :-)

Thanks!

patrickvonplaten commented 2 years ago

Regarding 1.) The fairseq checkpoint is so large because all the training states are included which are unnecessary for inference. I made sure that the HF checkpoint behaves exactly like the fairseq one by running multiple integration tests.

programmeddeath1 commented 2 years ago

Hi, 1) I ran the sample code using the ASR pipeline on colab -

import torch
import torchaudio
import matplotlib.pyplot as plt

import IPython

print(torch.__version__)
print(torchaudio.__version__)

torch.random.manual_seed(0)
device = "cuda" if torch.cuda.is_available() else "cpu" 
MAPPING = {
    "en": 250004,
    "de": 250003,
    "tr": 250023,
    "fa": 250029,
    "sv": 250042,
    "mn": 250037,
    "zh": 250025,
    "cy": 250007,
    "ca": 250005,
    "sl": 250052,
    "et": 250006,
    "id": 250032,
    "ar": 250001,
    "ta": 250044,
    "lv": 250017,
    "ja": 250012,
}
from datasets import load_dataset
from transformers import pipeline

# select correct `forced_bos_token_id`
forced_bos_token_id = MAPPING["en"]

# replace following lines to load an audio file of your choice
librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
audio_file = librispeech_en[0]["file"]

asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-2b-22-to-16", feature_extractor="facebook/wav2vec2-xls-r-2b-22-to-16")

translation = asr(audio_file, forced_bos_token_id=forced_bos_token_id)
# This translation is different from the audio file 
print(translation)
IPython.display.Audio(audio_file)

2) I also used the same hugging face api code from - https://huggingface.co/spaces/facebook/XLS-R-2B-22-16/tree/main

Here is the colab link if you want to see the code - ( https://colab.research.google.com/drive/1keVShJfrB68IeXn44UYB4OPoC2qDIGk1?usp=sharing )I ran it on colab pro with gpu.

I recorded the audio and ran it on the hugging face spaces, it translated perfectly, I downloaded the audio and uploaded to my instance of the same api on colab and on aws, but its showing random repetitive text like this("The amendment number one hundred and twenty-eight from the amendment number two hundred and twenty-eight from the")

I am not sure where I am going wrong.

Thanks for your help1

patrickvonplaten commented 2 years ago

Hey @programmeddeath1,

You might have not correctly resampled the audio. Essentially what I would try is to just copy paste the code of the spaces here: https://huggingface.co/spaces/facebook/XLS-R-2B-22-16/blob/main/app.py

Nithin-Holla commented 2 years ago

@patrickvonplaten Only slightly related to this issue, but I am not able to initialise the processor like in the example on the model card:

import torch
from transformers import Speech2Text2Processor, SpeechEncoderDecoderModel

model = SpeechEncoderDecoderModel.from_pretrained("facebook/wav2vec2-xls-r-300m-en-to-15")
processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-300m-en-to-15")

I get the following error on transformers v4.12.5:

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'MBart50Tokenizer'. 
The class this function is called from is 'Speech2Text2Tokenizer'.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nithinholla/opt/anaconda3/lib/python3.8/site-packages/transformers/models/speech_to_text_2/processing_speech_to_text_2.py", line 106, in from_pretrained
    tokenizer = Speech2Text2Tokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/Users/nithinholla/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1744, in from_pretrained
    return cls._from_pretrained(
  File "/Users/nithinholla/opt/anaconda3/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1872, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/Users/nithinholla/opt/anaconda3/lib/python3.8/site-packages/transformers/models/speech_to_text_2/tokenization_speech_to_text_2.py", line 85, in __init__
    with open(vocab_file, encoding="utf-8") as vocab_handle:
TypeError: expected str, bytes or os.PathLike object, not NoneType
programmeddeath1 commented 2 years ago

hi @patrickvonplaten, Thank you for your reply That is what I have tried in colab and on an aws gpu instance, I copy pasted and ran the same code of the spaces that you have shared. sampling shouldn't be a problem since it is being resampled using librosa in your code. I have not changed anything and hosted the same code on colab right now, ( https://13310.gradio.app ) is the link. It is decoding to something quite random compared to the audio input.

patrickvonplaten commented 2 years ago

@Nithin-Holla,

Good catch! Yeah at the moment it's actually not possible to create a processor for ""facebook/wav2vec2-xls-r-300m-en-to-15"" . I have an open PR that will enable this - hope to get it merged by next week. But it should be the Wav2Vec2Processor then :-)

patrickvonplaten commented 2 years ago

@programmeddeath1

The model behaves correctly for me locally for an input waveform. Could you maybe send me a link to an audio file which gives different results for you?

programmeddeath1 commented 2 years ago

Hii @patrickvonplaten

Does the model hub have different instances from which it serves the models to different regions in the world? This audio file(https://github.com/programmeddeath1/webhost/blob/master/ghwsa-1f9vw.wav) on running on my local instance (https://16618.gradio.app) gives output as this Screenshot from 2021-12-02 07-56-37

But gives the output as this if I use the api (translation - these courses I am able to write much more complex code in python )

The code is the exact same, the only change that could be is in the resources being downloaded from the hub. Can you check on an instance, where the resources are downloading afresh from the hub? The gradio app i have attached is currently functioning.

patrickvonplaten commented 2 years ago

There is only one model that is being used. I can't open your file with soundfile - see: https://colab.research.google.com/drive/1fVd18B1lwKeoTqw9ucMzjuS5wE-sIMA0?usp=sharing

patrickvonplaten commented 2 years ago

Also let's try to solve this together - @programmeddeath1 could you create a google colab in which you re-create the spaces demo so that we can see together how the output could be different. I just checked again and the model works as expected for me locally

programmeddeath1 commented 2 years ago

Hi @patrickvonplaten I have added the spaces code and shared the colab with you.

https://colab.research.google.com/drive/1Bk9XGoDnxg3wadKVecREXjUth5dMkTky?authuser=2#scrollTo=oKv64CiwHni5

I have uploaded the same audio file and the output can be seen on the colab display (It’s a nice beach, a nice beach and a nice beach.)

Please run the same and upload the audio file or a similar audio file on the colab.

We can come over a small call, I can share the screen and show the execution when I am downloading and running it.

Thanks!

patrickvonplaten commented 2 years ago

Hey @programmeddeath1,

I sadly can't open the colab. It says:

Notebook loading error
There was an error loading this notebook. Ensure that the file is accessible and try again.
Invalid Credentials

=> Can you make sure the google colab is accessible by everyone?

programmeddeath1 commented 2 years ago

Hi @patrickvonplaten Here is the open link to view https://colab.research.google.com/drive/1Bk9XGoDnxg3wadKVecREXjUth5dMkTky?usp=sharing

I had shared the editor access to your account - patrick@huggingface.co. I have now shared it to your gmail account too.

Tell me if I should give edit access to the open link, but it may be edited by someone else, so I just gave comment access.

patrickvonplaten commented 2 years ago

Hey @programmeddeath1,

I can now access the google colab, but this doesn't really help me to find the problem. Sorry, I've probably not been very clear in the previous message.

What I need to efficiently find a possible difference is:

a) an audio file that I can run the spaces demo with. You already provided that here: wget https://github.com/programmeddeath1/webhost/blob/master/ghwsa-1f9vw.wav . However this audio file is not readable it's broken. I cannot work with this. BTW, I now added the possibility to directly upload an audio file to the demo: https://huggingface.co/spaces/facebook/XLS-R-2B-22-16 . So I just need an audio file that I can then upload to the demo now. b) A simple python script (just transformers code - no gradio app that should start in a colab) that runs the same code as the demo: https://huggingface.co/spaces/facebook/XLS-R-2B-22-16 but gives a different result.

I sadly cannot help otherwise. Could you please correct a) & b)?

programmeddeath1 commented 2 years ago

Hi @patrickvonplaten , i was wondering if we could come on a googlemeet/zoom call together and debug ,i will make things quite clear , should i put an invite ?

patrickvonplaten commented 2 years ago

Hey @programmeddeath1,

I'm very sorry but I don't have the time to schedule google meets for specific issues. We're trying to tackle hundreds of issues every day at HF and have to try to be as efficient as possible. Could you maybe take a look at this document: https://github.com/huggingface/transformers/blob/master/ISSUES.md explaining how to best ask for help? :-)

Thanks!

programmeddeath1 commented 2 years ago

Hey, Sorry could not reply due to a few deliveries, I am trying to setup the colab as per your previous comment, Iwhen i converted to a simple python code to load the model and run using a local file it is running properly but in the same colab or on aws if I run your gradio app as is, it keeps giving junk responses. It seems like its an issue with the forced_bos_tokein_id. I hardcoded the tokenid to english - 250004, its working fine now. Thank you for your help!

programmeddeath1 commented 2 years ago

I uploaded these two files

https://github.com/programmeddeath1/webhost/blob/master/7_4.wav https://github.com/programmeddeath1/webhost/blob/master/9_4.wav It is giving quite weird results with these audio files. This is Indian English and the model gives quite good results for other audio files with similar accent. Can you listen to the audio and tell me if this is an issue with the audio sample, coz on listening it seems quite legible, or is this a model issue and whether I need to finetune the model further. If so can you guide me to the right resources to improve the model. Thanks!