kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
416 stars 89 forks source link

[Integration with 🤗 Hugging Face] Add load_from_hub to BeamSearchDecoder #32

Closed patrickvonplaten closed 2 years ago

patrickvonplaten commented 2 years ago

Hey PyCTCDecode team,

Update: Add load_from_hf_hub to BeamSearchDecoder instead

This PR is a proposition to add the possibility to load KenLM models directly from the Hugging Face hub. I've uploaded an example kenLM model under: https://huggingface.co/kensho/beamsearch_decoder_dummy so that you can try out the loading a beam search decoder from the hub as follows:

from pyctcdecode import BeamSearchDecoderCTC

decoder = BeamSearchDecoderCTC.load_from_hf_hub("kensho/beamsearch_decoder_dummy")

Models are hosted for free on the Hugging Face hub with the goal of facilitating the user experience sharing and versioning models. In this case, the user is not required to download the raw model manually (via wget), but instead can integrate the model loading with a single line of code: decoder = BeamSearchDecoderCTC.load_from_hf_hub("kensho/beamsearch_decoder_dummy") into a python script. The loading method automatically caches the downloaded file so that the user will only have to download the model once.

@mikeyshulman @gkucsko @poneill - please let me know what you think about the integration and whether anything can be improved :-)

patrickvonplaten commented 2 years ago

This looks great!

My only question is about testing. While we don't want our tests to actually call out to hf hub, maybe we can still add a test. Even if it's thin and mocks out the load_from_hf_hub to just return the binary contents of the little arpa file checked into the tests directory, it will make sure the code actually runs and returns a kosher LanguageModel. I'd also be in favor of putting huggingface_hub in the dev requirements in setup.py.

As an aside, are there any other guidelines/best practices HF recommends to package developers to make sure hub integration works?

Awesome!

Yeah that's a good question about testing! Actually what would be nice is to add some functionality that if pretrained_path is a local path -> then it should load the file simply pass the file name to init(). This could also be very easily tested - I'll update the PR :-)

patrickvonplaten commented 2 years ago

@mikeyshulman @gkucsko - thanks a lot for the review! I applied the proposed changes. All tests except pyctcdecode/tests/test_decoder.py::TestSerialization::test_load_from_hub_offline are now passing. pyctcdecode/tests/test_decoder.py::TestSerialization::test_load_from_hub_offline does pass locally for me, but we'll need to wait until https://github.com/huggingface/huggingface_hub/pull/505 is merged and a patch is released.

So it's on us now to finish this ;-) I'll ping you here again once the PR is merged!

patrickvonplaten commented 2 years ago

https://github.com/huggingface/huggingface_hub/pull/505 is merged and released on pip. All tests are now passing locally. If this PR is ok for you - I think it's good to go 🚀 @mikeyshulman @gkucsko