AutoProcessor.from_pretrained doesn't support MCTCT Models

System Info

Not actually relevant, but included for completeness:

transformers version: 4.29.1
Platform: Linux-4.18.0-348.23.1.el8_5.x86_64-x86_64-with-glibc2.28
Python version: 3.10.11
Huggingface_hub version: 0.14.1
Safetensors version: not installed
PyTorch version (GPU?): 2.0.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): 0.6.1 (cpu)
Jax version: 0.4.9
JaxLib version: 0.4.9
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@sanchit-gandhi

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoProcessor, MCTCTProcessor
mctc_proc1 = AutoProcessor.from_pretrained("speechbrain/m-ctc-t-large")
mctc_proc2 = MCTCTProcessor.from_pretrained("speechbrain/m-ctc-t-large")
print(f"AutoProcessor: {mctc_proc1}")
print(f"MCTCTProcessor: {mctc_proc2}")

The first line prints a MCTCTProcessor instance, containing aMCTCTFeatureExtractor feature extractor and Wav2Vec2CTCTokenizer tokenizer) while the second prints just an Wav2Vec2CTCTokenizer instance.

Expected behavior

AutoProcessor.from_pretrained should return an MCTCTProcessor instance when the provided model is an MCTCT model.

The reason it does not right now is because the code for AutoProcessor does not include a mapping entry for MCTCT.

PROCESSOR_MAPPING_NAMES = OrderedDict(
    [
        ("align", "AlignProcessor"),
        ("altclip", "AltCLIPProcessor"),
        ("blip", "BlipProcessor"),
        ("blip-2", "Blip2Processor"),
        ("bridgetower", "BridgeTowerProcessor"),
        ("chinese_clip", "ChineseCLIPProcessor"),
        ("clap", "ClapProcessor"),
        ("clip", "CLIPProcessor"),
        ("clipseg", "CLIPSegProcessor"),
        ("flava", "FlavaProcessor"),
        ("git", "GitProcessor"),
        ("groupvit", "CLIPProcessor"),
        ("hubert", "Wav2Vec2Processor"),
        ("layoutlmv2", "LayoutLMv2Processor"),
        ("layoutlmv3", "LayoutLMv3Processor"),
        ("markuplm", "MarkupLMProcessor"),
        ("mgp-str", "MgpstrProcessor"),
        ("oneformer", "OneFormerProcessor"),
        ("owlvit", "OwlViTProcessor"),
        ("pix2struct", "Pix2StructProcessor"),
        ("sam", "SamProcessor"),
        ("sew", "Wav2Vec2Processor"),
        ("sew-d", "Wav2Vec2Processor"),
        ("speech_to_text", "Speech2TextProcessor"),
        ("speech_to_text_2", "Speech2Text2Processor"),
        ("speecht5", "SpeechT5Processor"),
        ("trocr", "TrOCRProcessor"),
        ("tvlt", "TvltProcessor"),
        ("unispeech", "Wav2Vec2Processor"),
        ("unispeech-sat", "Wav2Vec2Processor"),
        ("vilt", "ViltProcessor"),
        ("vision-text-dual-encoder", "VisionTextDualEncoderProcessor"),
        ("wav2vec2", "Wav2Vec2Processor"),
        ("wav2vec2-conformer", "Wav2Vec2Processor"),
        ("wavlm", "Wav2Vec2Processor"),
        ("whisper", "WhisperProcessor"),
        ("xclip", "XCLIPProcessor"),
    ]
)

An MCTCTProcessor class exists whose from_pretrained function behaves appropriately. AutoProcessor should behave the same way, rather than falling back to a tokenizer.

The fix seems simple enough, by adding the entry below to PROCESSOR_MAPPING_NAMES (but I am far from an expert):

("mctct", "MCTCTProcessor"),

For comparison, the AutoModel.from_pretrained method does support MCTCT and thus behaves appropriately because its mapping contains a line for MCTCT.

huggingface / transformers