mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
24.72k stars 3.92k forks source link

Proposal of DeepSpeech integration with HuggingFace model hub #3484

Open patrickvonplaten opened 3 years ago

patrickvonplaten commented 3 years ago

Hey DeepSpeech team!

At HuggingFace, we would like to propose an integration with the HuggingFace model hub: https://github.com/huggingface/huggingface_hub. The HuggingFace hub hosts models so that users have an easier time using and sharing pre-trained models. It could make it easier for DeepSpeech users to load pre-trained models with a simple .from_pretrained(...) call directly in Python, e.g.:

from ds_ctcdecoder import Scorer 
asr_model = Scorer.from_pretrained("mozilla/deepspeech-0.6.0")
...

whereas we would host all relevant files / config params:

FLAGS.lm_alpha, FLAGS.lm_beta, FLAGS.lm_binary_path, FLAGS.lm_trie_path, Config.alphabet

online under an organization namespace (e.g. "mozilla") and a model name (e.g. "deepspeech-0.6.0").

In addition, DeepSpeech users could directly try out the model online: e.g. for text-to-speech models this currently looks as follows: https://huggingface.co/julien-c/ljspeech_tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space_train?text=Hello%2C+how+are+you+doing%3F and we're working on a speech-to-text inference API as well: https://huggingface.co/julien-c/mini_an4_asr_train_raw_bpe_valid so that audio files can be drag & dropped online to be transcribed directly.

Public models will always be hosted for free :-)

We would be very happy to discuss a possible integration if you guys are interested!

reuben commented 3 years ago

Hey @patrickvonplaten, thanks for reaching out. Our inference API is built around a C++ library (with a C API), which has bindings for several languages/runtimes, including Python, JS on Node.JS/Electron, C# on .NET, Rust, Java on Android, Swift on iOS, Raspberry Pi, etc.

We try to maintain parity between the language bindings which means in order to land in core this integration would have to be implemented in the C++ library itself and exposed to the bindings. The model loading code works on all the platforms above, and is tested in CI for a subset of them.

For a Python only solution, it would be best placed in a separate library that leverages the DeepSpeech Python bindings. It can either live in, or be linked from, the DeepSpeech-examples repository.

In addition, DeepSpeech users could directly try out the model online: e.g. for text-to-speech models this currently looks as follows: https://huggingface.co/julien-c/ljspeech_tts_train_tacotron2_raw_phn_tacotron_g2p_en_no_space_train?text=Hello%2C+how+are+you+doing%3F and we're working on a speech-to-text inference API as well: https://huggingface.co/julien-c/mini_an4_asr_train_raw_bpe_valid so that audio files can be drag & dropped online to be transcribed directly.

Public models will always be hosted for free :-)

Will the online inference endpoint also be always free?

xloem commented 2 years ago

huggingface has a normative python model interface where each model has an architecture wrapped with code in their codebase. Often this means copying work from other codebases or languages into their python transformers code.

The huggingface inference endpoints do not appear to be free, but they provide free demos of the models for web visitors to try them out.

The open source huggingface transformers codebase is very convenient and easy to use, as a developer. It gives the developer the personal control of advanced technology shortly after it reaches research, that can otherwise seem inaccessible, and transformers is actively being used to form new research and products across the globe. It ties together models and work that would otherwise be an obscure mess spread across the web. The transformers codebase usually accesses copies of models that are uploaded to huggingface's cached public git repositories, but can also access models locally.

I believe at present the only speech to text models available on huggingface's model hub are from facebook.

It's possible to set up models in huggingface transformers such that they can be used with multiple different machine learning architectures: for example a user could theoretically choose between pytorch, jax, or straight tensorflow.

The huggingface codebase does not read like it was written by a computer scientist and it can be more laborious than needed to implement things due to some of the factoring choices.

Combining these two projects would likely mean adding to the transformers codebase a python interface to the deepspeech model that uses a normative python library to load and use it, such as pytorch or jax. Then either a compatible model would be uploaded to huggingface, or the transformers codebase would be extended to retrieve the model directly from mozilla.

Alternatively mozilla's deepspeech python bindings could be referenced straight from the transformers codebase. This could make implementation comparably quite easy.

To huggingface: among community software we like to make sure that the user always has full control, so as to foremost act to protect people above all other things. So, there can be bumps when things like hardcoded centralised hosting or for-pay features are found.

[EDITED with a couple more paragraphs throughout]