[WIP] Add language selection through config with whisper + Improve tests

linto-ai / linto-stt

An automatic speech recognition API

GNU Affero General Public License v3.0

45 stars 14 forks source link

[WIP] Add language selection through config with whisper + Improve tests #48

Open AudranBert opened 1 week ago

AudranBert commented 1 week ago

Add language selection for streaming with whisper, by default it will take the language found in the env settings. But you can pass a language in the config when starting streaming.

It also adds the possibility to pass a language in the config in case of offline decoding as requested in #53 . It will enable having a same model instance used for multiple languages instead of launching another Docker.

The PR is also improving tests to add tests about languages. Also removing some useless ones in order to reduce testing duration.

damienlaine commented 1 day ago

Could you clarify the list of supported languages? For example, does it include "en," "fr," etc.? On the LinTO side, we consistently use BCP-47 codes for language representation. Parsers (env, API directives...) shall at least support BCP-47 codes as inputs.

Jeronymous commented 1 day ago

Could you clarify the list of supported languages? For example, does it include "en," "fr," etc.? On the LinTO side, we consistently use BCP-47 codes for language representation. Parsers (env, API directives...) shall at least support BCP-47 codes as inputs.

That did not changes in this PR. several formats are supported : "fr" and "fr-FR". This holds for the whole LinTO speech toolkit.

Supported languages are listed here : https://github.com/linto-ai/linto-stt/blob/master/whisper/README.md#language

Also if the user gives a wrong one, it will give an explicit message with the list of possible ones (in the format "fr").

Why this question ? Do you think something is missing in the code or the documentation ?

damienlaine commented 1 day ago

I haven’t reviewed the code and relied on the doc:

The docs mention "two or three-letter codes" for languages but not BCP-47 tags—should this be clarified?
The PR focuses on streaming (?), but what about Celery (task) and HTTP service modes? Are specification updates planned for these?
For Celery, should we open an issue in https://github.com/linto-ai/linto-transcription to handle the target language correctly?

AudranBert commented 23 hours ago

The PR focuses on streaming (?), but what about Celery (task) and HTTP service modes? Are specification updates planned for these?

The PR was created to fix the selection language in streaming, but I added the possibility to send the language through the config for streaming and offline (http and task). That's why I linked this PR to the issue #53

AudranBert commented 21 hours ago

The docs mention "two or three-letter codes" for languages but not BCP-47 tags—should this be clarified?

It should work with tags like "fr-FR" because it will split on the "-" and keep the first part (here "fr") and use that as language.

Jeronymous commented 21 hours ago

The docs mention "two or three-letter codes" for languages but not BCP-47 tags—should this be clarified?

Yes we should mention that they are supported, but that the second part ("FR" in "fr-FR") is ignored (results of the model are invariant to this)

The PR focuses on streaming (?), but what about Celery (task) and HTTP service modes? Are specification updates planned for these?

Yes. The PR is not finished yet ("WIP" in the title)

For Celery, should we open an issue in https://github.com/linto-ai/linto-transcription to handle the target language correctly?

Yes. There will be an issue with that feature request. Worst case I will make it when I will commit related things (mentioning the issue in the commit message : we discussed to use this as much as possible). (our plan is to split the work : Audran here on core stt / me on transcription service API evolution)