[RFC] Updating pipeline models

Rocketknight1 commented 9 months ago

Feature request

We're considering updating the default models used in transformers pipelines. This has the potential to greatly improve performance, and get rid of limitations caused by the existing models, but it may also break backward compatibility. Many of the default models have not been changed since the tasks were first added in pipelines, so users might assume that they are 'permanent', and might be surprised by an update.

When updating pipelines, we would aim for the following objectives:

The model should run on a base Colab instance (i.e. inference at max sequence length should fit inside 16GB VRAM)
The default context length for text tasks should be long (at least 4k tokens where possible, ideally infinite with rope/alibi scaling)
The performance should be as strong as reasonably possible within those two constraints

Motivation

We have seen a number of user issues prompted by the default pipeline models in transformers being outdated. For example, the default sentiment-analysis pipeline uses a finetuned distilbert model with a maximum sequence length of 512 tokens. You can see the full list of default models here.

Performance on these tasks could be greatly improved with more modern models that have newer features like longer (potentially unlimited!) context lengths.

Your contribution

I'll make the PR and potentially train new models for some of these tasks.

julien-c commented 9 months ago

maybe let's do it for one or two pipelines and we'll see if it breaks many things in the wild? (as long as the model outputs have the same "shape", i'm not sure it would break many things)

sanchit-gandhi commented 9 months ago

Nice idea! We have three audio pipelines in transformers:

Text to Audio (aliased to Text to Speech)
Audio Classification
Automatic Speech Recognition

Text to audio is relatively new, so the default model used there is already up to date: https://huggingface.co/suno/bark-small

Like text classification, audio classification requires a model specific to the classification tasks. E.g. for key-word spotting (KWS), you need to use an audio classification model trained on the KWS task. Similarly for language identification (LID), you need to use an audio classification model trained on the LID task. Therefore, it's probably not too useful changing the default model, since it's likely users pass a specific checkpoint for their task already.

For speech recognition, we should definitely consider updating from Wav2Vec2 to Whisper. There are 5 checkpoint sizes to select from, so there should be one compatible with the hardware/performance constraints you've outlined: https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013

Rocketknight1 commented 8 months ago

Don't stale! We're still planning this

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers