huggingface / transformers

đŸ¤— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.78k stars 26.95k forks source link

AutoProcessor.from_pretrained doesn't support MCTCT Models #23853

Closed Ubadub closed 1 year ago

Ubadub commented 1 year ago

System Info

Not actually relevant, but included for completeness:

Who can help?

@sanchit-gandhi

Information

Tasks

Reproduction

from transformers import AutoProcessor, MCTCTProcessor
mctc_proc1 = AutoProcessor.from_pretrained("speechbrain/m-ctc-t-large")
mctc_proc2 = MCTCTProcessor.from_pretrained("speechbrain/m-ctc-t-large")
print(f"AutoProcessor: {mctc_proc1}")
print(f"MCTCTProcessor: {mctc_proc2}")

The first line prints a MCTCTProcessor instance, containing aMCTCTFeatureExtractor feature extractor and Wav2Vec2CTCTokenizer tokenizer) while the second prints just an Wav2Vec2CTCTokenizer instance.

Expected behavior

AutoProcessor.from_pretrained should return an MCTCTProcessor instance when the provided model is an MCTCT model.

The reason it does not right now is because the code for AutoProcessor does not include a mapping entry for MCTCT.

PROCESSOR_MAPPING_NAMES = OrderedDict(
    [
        ("align", "AlignProcessor"),
        ("altclip", "AltCLIPProcessor"),
        ("blip", "BlipProcessor"),
        ("blip-2", "Blip2Processor"),
        ("bridgetower", "BridgeTowerProcessor"),
        ("chinese_clip", "ChineseCLIPProcessor"),
        ("clap", "ClapProcessor"),
        ("clip", "CLIPProcessor"),
        ("clipseg", "CLIPSegProcessor"),
        ("flava", "FlavaProcessor"),
        ("git", "GitProcessor"),
        ("groupvit", "CLIPProcessor"),
        ("hubert", "Wav2Vec2Processor"),
        ("layoutlmv2", "LayoutLMv2Processor"),
        ("layoutlmv3", "LayoutLMv3Processor"),
        ("markuplm", "MarkupLMProcessor"),
        ("mgp-str", "MgpstrProcessor"),
        ("oneformer", "OneFormerProcessor"),
        ("owlvit", "OwlViTProcessor"),
        ("pix2struct", "Pix2StructProcessor"),
        ("sam", "SamProcessor"),
        ("sew", "Wav2Vec2Processor"),
        ("sew-d", "Wav2Vec2Processor"),
        ("speech_to_text", "Speech2TextProcessor"),
        ("speech_to_text_2", "Speech2Text2Processor"),
        ("speecht5", "SpeechT5Processor"),
        ("trocr", "TrOCRProcessor"),
        ("tvlt", "TvltProcessor"),
        ("unispeech", "Wav2Vec2Processor"),
        ("unispeech-sat", "Wav2Vec2Processor"),
        ("vilt", "ViltProcessor"),
        ("vision-text-dual-encoder", "VisionTextDualEncoderProcessor"),
        ("wav2vec2", "Wav2Vec2Processor"),
        ("wav2vec2-conformer", "Wav2Vec2Processor"),
        ("wavlm", "Wav2Vec2Processor"),
        ("whisper", "WhisperProcessor"),
        ("xclip", "XCLIPProcessor"),
    ]
)

An MCTCTProcessor class exists whose from_pretrained function behaves appropriately. AutoProcessor should behave the same way, rather than falling back to a tokenizer.

The fix seems simple enough, by adding the entry below to PROCESSOR_MAPPING_NAMES (but I am far from an expert):

("mctct", "MCTCTProcessor"),

For comparison, the AutoModel.from_pretrained method does support MCTCT and thus behaves appropriately because its mapping contains a line for MCTCT.

sgugger commented 1 year ago

cc @sanchit-gandhi