kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

[WIP] Proof of concept of integration with Hugging Face Hub #356

Closed patrickvonplaten closed 2 years ago

patrickvonplaten commented 3 years ago

Hi @kpu! I hereby propose a proof of concept integration with the Hugging Face Hub

Your library is the go-to library for n-gram models in Python. We have seen multiple blog posts that combine neural network speech recognition models with kenlm for efficient automatic speech recognition, e.g.:

We wanted to ask whether you would be interested in adding an optional dependency of the huggingface_hub so that users of your library can better share and download trained n-gram models. Instead of having to download the kenlm models manually (e.g. https://goofy.zamia.org/zamia-speech/lm/), a simlpe API could be added so that users can load the model directly in Python:

import kenlm
- model = kenlm.Model('lm/test.arpa')
+ model = kenlm.Model.load_from_hub('kenlm/test.arpa')
print(model.score('this is a sentence .', bos = True, eos = True))

This way the user doesn't have to download the file lm/test.arpa before running the Python script. Instead on the first time the Python script is run the model is downloaded from the HuggingFace hub: https://huggingface.co/ and consequently cached locally, so that users only have to download the model once.

A lot of other NLP libraries have been added to the hub, e.g. https://huggingface.co/spacy/ca_core_news_lg, which allows their users to directly download the models, share models which each other and also to see dowload statistics.

We would love to also integrate kenlm to the hub as well so that the community can easily download and share fast n-gram models. We are planning on showcasing some nice Wav2Vec2 + kenlm CTC decoding (similar to what pyctcdecode is doing here: https://github.com/kensho-technologies/pyctcdecode/blob/main/tutorials/02_pipeline_huggingface.ipynb.

Would this sound interesting to you? If yes, I would be more than happy to make a nicer pull request and add an example to the README :-)

Some follow-ups could be:

cc @osanseviero @LysandreJik, @julien-c, @anton-l

rezatakhshid commented 2 years ago

Would love to see this happening.