audio-spectrogram-transformer support on Optimum export

Feature request

Would be great if we could export audio-spectrogram-transformer-type models using Optimum to ONNX. Right now, I get this error:

(transformers-v2) victor@Victors-MBP Desktop % optimum-cli export onnx --model MIT/ast-finetuned-audioset-10-10-0.4593 /tmp/onyx --optimize O1 --device cpu
Framework not specified. Using pt to export to ONNX.
Automatic task detection to audio-classification.
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
Using framework PyTorch: 2.1.0
Traceback (most recent call last):
  File "/Users/victor/anaconda3/envs/transformers-v2/lib/python3.9/site-packages/optimum/onnxruntime/optimization.py", line 67, in __init__
    self.normalized_config = NormalizedConfigManager.get_normalized_config_class(self.model_type)(self.config)
  File "/Users/victor/anaconda3/envs/transformers-v2/lib/python3.9/site-packages/optimum/utils/normalized_config.py", line 271, in get_normalized_config_class
    cls.check_supported_model(model_type)
  File "/Users/victor/anaconda3/envs/transformers-v2/lib/python3.9/site-packages/optimum/utils/normalized_config.py", line 263, in check_supported_model
    raise KeyError(
KeyError: 'audio-spectrogram-transformer model type is not supported yet in NormalizedConfig. Only albert, bart, bert, blenderbot, blenderbot-small, bloom, falcon, camembert, codegen, cvt, deberta, deberta-v2, deit, distilbert, donut-swin, electra, encoder-decoder, gpt2, gpt-bigcode, gpt-neo, gpt-neox, llama, gptj, imagegpt, longt5, marian, mbart, mistral, mt5, m2m-100, nystromformer, opt, pegasus, pix2struct, poolformer, regnet, resnet, roberta, speech-to-text, splinter, t5, trocr, whisper, vision-encoder-decoder, vit, xlm-roberta, yolos, mpt are supported. If you want to support audio-spectrogram-transformer please propose a PR or open up an issue.'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/victor/anaconda3/envs/transformers-v2/bin/optimum-cli", line 8, in <module>
    sys.exit(main())
  File "/Users/victor/anaconda3/envs/transformers-v2/lib/python3.9/site-packages/optimum/commands/optimum_cli.py", line 163, in main
    service.run()
  File "/Users/victor/anaconda3/envs/transformers-v2/lib/python3.9/site-packages/optimum/commands/export/onnx.py", line 246, in run
    main_export(
  File "/Users/victor/anaconda3/envs/transformers-v2/lib/python3.9/site-packages/optimum/exporters/onnx/__main__.py", line 554, in main_export
    optimizer = ORTOptimizer.from_pretrained(output, file_names=onnx_files_subpaths)
  File "/Users/victor/anaconda3/envs/transformers-v2/lib/python3.9/site-packages/optimum/onnxruntime/optimization.py", line 119, in from_pretrained
    return cls(onnx_model_path, config=config, from_ortmodel=from_ortmodel)
  File "/Users/victor/anaconda3/envs/transformers-v2/lib/python3.9/site-packages/optimum/onnxruntime/optimization.py", line 69, in __init__
    raise NotImplementedError(
NotImplementedError: Tried to use ORTOptimizer for the model type audio-spectrogram-transformer, but it is not available yet. Please open an issue or submit a PR at https://github.com/huggingface/optimum.

Motivation

Would like to use this in ONNX.

Your contribution

Hi there 👋 You should be able to export the model without --optimize O1. Is the O1 optimization necessary for your use-case?

$ optimum-cli export onnx --model MIT/ast-finetuned-audioset-10-10-0.4593 out
Framework not specified. Using pt to export to ONNX.
Downloading config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26.8k/26.8k [00:00<00:00, 95.3MB/s]
Downloading model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 346M/346M [00:07<00:00, 48.1MB/s]
Automatic task detection to audio-classification.
Downloading (…)rocessor_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 297/297 [00:00<00:00, 1.44MB/s]
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
Using framework PyTorch: 2.1.0+cu121
Post-processing the exported models...
Deduplicating shared (tied) weights...
Validating ONNX model out/model.onnx...
        -[✓] ONNX model output names match reference model (logits)
        - Validating ONNX Model output "logits":
                -[✓] (2, 527) matches (2, 527)
                -[✓] all values close (atol: 0.0001)
The ONNX export succeeded and the exported model was saved at: out

huggingface / optimum

audio-spectrogram-transformer support on Optimum export #1533

Feature request

Motivation

Your contribution