ASR example doesn't save tokenizer settings

RobertBaruch commented 1 year ago

System Info

transformers version: 4.28.1
Platform: Windows-10-10.0.22621-SP0
Python version: 3.11.2
Huggingface_hub version: 0.14.1
Safetensors version: not installed
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: NO
Using distributed or parallel set-up in script?: NO

Who can help?

@sgugger

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Run training using run_speech_recognition_ctc.py and the included json file.

train.json.zip

Next, attempt to infer using the trained model:

import os.path

from datasets import load_dataset
from datasets import Audio
from transformers import pipeline, AutomaticSpeechRecognitionPipeline

cv13 = load_dataset(
    "mozilla-foundation/common_voice_13_0",
    "eo",
    split="train[:10]",
    )
print(cv13[0])
cv13 = cv13.cast_column("audio", Audio(sampling_rate=16000))
sampling_rate = cv13.features["audio"].sampling_rate
audio_file = cv13[0]["audio"]["path"]
d, n = os.path.split(audio_file)
audio_file = os.path.join(d, "eo_train_0", n)
print(audio_file)

transcriber: AutomaticSpeechRecognitionPipeline = pipeline(
    "automatic-speech-recognition",
    model="xekri/wav2vec2-common_voice_13_0-eo-demo2",
)
print(transcriber(audio_file))

Output:

Found cached dataset common_voice_13_0 (C:/Users/rober/.cache/huggingface/datasets/mozilla-foundation___common_voice_13_0/eo/13.0.0/22809012aac1fc9803eaffc44122e4149043748e93933935d5ea19898587e4d7)
{'client_id': 'b8c51543fe043c8f27d0de0428e060e309d9d824ac9ad33e40aba7062dafd99e2e87bbedc671007e31973afb599b1c290dbd922637b79132727b5f37bc1ee88e', 'path': 'C:\\Users\\rober\\.cache\\huggingface\\datasets\\downloads\\extracted\\1dea8f044902d398c6cb09bfb5629dc2fbd80a6309ddd435c4554fa38f730472\\common_voice_eo_20453647.mp3', 'audio': {'path': 'C:\\Users\\rober\\.cache\\huggingface\\datasets\\downloads\\extracted\\1dea8f044902d398c6cb09bfb5629dc2fbd80a6309ddd435c4554fa38f730472\\common_voice_eo_20453647.mp3', 'array': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
       -1.16407300e-11,  1.07661449e-12, -1.71219774e-11]), 'sampling_rate': 48000}, 'sentence': 'Ĉu ili tiel plaĉas al vi?', 'up_votes': 2, 'down_votes': 0, 'age': 'twenties', 'gender': 'male', 'accent': 'Internacia', 'locale': 'eo', 'segment': '', 'variant': ''}
C:\Users\rober\.cache\huggingface\datasets\downloads\extracted\1dea8f044902d398c6cb09bfb5629dc2fbd80a6309ddd435c4554fa38f730472\eo_train_0\common_voice_eo_20453647.mp3
Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.27k/2.27k [00:00<?, ?B/s]
F:\eo-reco\.env\Lib\site-packages\huggingface_hub\file_download.py:133: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\rober\.cache\huggingface\hub. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
Downloading pytorch_model.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.26G/1.26G [01:56<00:00, 10.8MB/s]
Traceback (most recent call last):
  File "F:\eo-reco\infer.py", line 20, in <module>
    transcriber: AutomaticSpeechRecognitionPipeline = pipeline(
                                                      ^^^^^^^^^
  File "F:\eo-reco\.env\Lib\site-packages\transformers\pipelines\__init__.py", line 876, in pipeline
    tokenizer = AutoTokenizer.from_pretrained(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\eo-reco\.env\Lib\site-packages\transformers\models\auto\tokenization_auto.py", line 723, in from_pretrained
    return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "F:\eo-reco\.env\Lib\site-packages\transformers\tokenization_utils_base.py", line 1795, in from_pretrained
    raise EnvironmentError(
OSError: Can't load tokenizer for 'xekri/wav2vec2-common_voice_13_0-eo-demo2'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'xekri/wav2vec2-common_voice_13_0-eo-demo2' is the correct path to a directory containing all relevant files for a Wav2Vec2CTCTokenizer tokenizer.

Checking the uploaded repo, it seems that no tokenizer-related files (e.g. vocab.json, tokenizer_config.json, etc) were pushed.

I added some debug to run_speech_recognition_ctc.py and found that these files were generated locally, but got deleted locally during step 7 when Trainer was initialized (line 701).

The output from run_speech_recognition_ctc.py at that point was:

loading file vocab.json
loading file tokenizer_config.json
loading file added_tokens.json
loading file special_tokens_map.json
Adding <s> to the vocabulary
Adding </s> to the vocabulary
Cloning https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-demo into local empty directory.
05/08/2023 15:06:23 - WARNING - huggingface_hub.repository - Cloning https://huggingface.co/xekri/wav2vec2-common_voice_13_0-eo-demo into local empty directory.
max_steps is given, it will override any value given in num_train_epochs

It seems that instantiating Training with push_to_hub=true creates a new repo and then empties anything in the local directory so that it can clone the repo into it. This deletes any files written to the local directory, which includes the tokenizer configs.

Expected behavior

No error.

RobertBaruch commented 1 year ago

The comment on Trainer.push_to_hub does say Upload *self.model* and *self.tokenizer* to the 🤗 model hub. And in fact, it does call the trainer's tokenizer.save_pretrained function. However, in run_speech_recognition_ctc.py, tokenizer is set to feature_extractor in the initialization, and Wav2Vec2FeatureExtractor.save_pretrained does not save tokenizer settings.

RobertBaruch commented 1 year ago

When I replace these lines at the end of run_speech_recognition_ctc from this:

    if training_args.push_to_hub:
        trainer.push_to_hub(**kwargs)
    else:
        trainer.create_model_card(**kwargs)

to this:

    tokenizer.save_pretrained(training_args.output_dir)
    trainer.create_model_card(**kwargs)
    if training_args.push_to_hub:
        trainer.push_to_hub(**kwargs)

we do get tokenizer files. Also, may as well write the model card in any case.

amyeroberts commented 1 year ago

cc @sanchit-gandhi

hollance commented 1 year ago

The code in the run_speech_recognition_ctc.py script as well as the instructions from the ASR guide that you used in issue https://github.com/huggingface/transformers/issues/23188 do the following:

trainer = Trainer(
    ...
    tokenizer=processor.feature_extractor,
    ...
)

The "processor" combines the feature extractor and tokenizer into a single class, but because we only pass the feature extractor to the Trainer, the tokenizer doesn't get saved. So that's clearly a mistake on our end.

The following fix should work:

trainer = Trainer(
    ...
    tokenizer=processor,
    ...
)

We're updating the docs to fix this. (It's a bit confusing that this argument from Trainer is called tokenizer but that's what's responsible for saving the non-model stuff.)

sanchit-gandhi commented 1 year ago

Probably we can directly add a new argument to the Trainer for the processor @hollance? This would stop all confusion IMO:

trainer = Trainer(
    ...
    processor=processor,
    ...
)

Here we could expect the user to pass either one of tokenizer or processor to the Trainer. Within the Trainer we only use the tokenizer to get the model input name, which after #20117 we can now get directly from the processor.

RobertBaruch commented 1 year ago

Can confirm, setting tokenizer=processor in run_speech_recognition_ctc.py works. Agree that tokenizer is a bit of a misleading keyword then.

sanchit-gandhi commented 1 year ago

Keeping this open since we really should update the Trainer to take processor as an argument over tokenizer=processor

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers