kantord / LibreLingo

🐢 🌎 📚 a community-owned language-learning platform
https://librelingo.app
GNU Affero General Public License v3.0
1.91k stars 209 forks source link

Import audio data from other free/open-source projects #714

Open ftyers opened 3 years ago

ftyers commented 3 years ago

One of the issues with the "dictation" feature is that it isn't always easy to find data for it.

It would be great to be able to import data from either Common Voice or [Tatoeba](), or both.

issue-label-bot[bot] commented 3 years ago

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.93. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

kantord commented 3 years ago

One possible challenge I see with importing audio from Tatoeba is that it has fixed audio recordings, instead of TTS.

With TTS, the person editing the course content has complete freedom in choosing the sentence, and the audio magically appears. In order to use Tatoeba, we'd have to come up with some solution for it also in the course editor. (Or maybe TTS could be used as a fallback when there's no exact match?)

kantord commented 3 years ago

I think Common Voice might have the same problem, but I see that there's an existing TTS project for it: https://github.com/mozilla/TTS

Perhaps this could be used to replace Amazon Polly as a TTS engine.

Ideally we'd have support for a set of TTS engines in order to maximize the number of supported languages.

kantord commented 3 years ago

There is also a browser API for speech synthesis: https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesis

I imagine in Firefox it will eventually use Common Voice. That would have the amazing benefit of not having to download the audios, but ensuring quality might be challenging.

ftyers commented 3 years ago

I don't think that using TTS is a good idea if it is going to be used for language learning. TTS, unless there is a carefully curated corpus is unlikely to produce good enough output. There would be issues with pronunciation, and especially with prosody and stress. I don't think it's a big problem that the sentences have to be selected from a pre-existing list of sentences. In fact, I think that this is good for learners doing dictation.

Unfortunately Mozilla has basically dumped their work on speech in their recent vote for collapse. So going on from here their speech stuff will I suppose be typical free/open-source projects, rather than something with a lot of backing.

Note that, if course designers want to include their own sentences, then one way would be for them to submit them to Tatoeba and to Common Voice. The Common Voice sentence collector is an online and pretty much automatic way of getting sentences into Common Voice. I believe Tatoeba has a similar system.

In my experience with language learning, listening and dictating real sentences produced by real people in real environmental conditions is much more valuable than trying to listen to a synthetic voice. Basically I think that using CV and Tatoeba keys us into two inclusive communities rather than relying on TTS, technology which is available for only a tiny fraction of the world's languages is a better plan.

kantord commented 3 years ago

I don't think that using TTS is a good idea if it is going to be used for language learning. TTS, unless there is a carefully curated corpus is unlikely to produce good enough output.

That's a good point, yeah. For sure there are few TTS options that are good enough. Amazon Polly is the one currently used, it's actually the same one Duolingo uses.

For many languages however, TTS will be out of the question, because the quality will be nowhere near usable. Let alone endangered languages or constructed languages. So I think we can conclude that at some point in the future, being able to use pre-recorded audio will be essential

kantord commented 3 years ago

Note that, if course designers want to include their own sentences, then one way would be for them to submit them to Tatoeba and to Common Voice. The Common Voice sentence collector is an online and pretty much automatic way of getting sentences into Common Voice. I believe Tatoeba has a similar system.

This sounds like a wonderful idea

ftyers commented 3 years ago

I don't think that using TTS is a good idea if it is going to be used for language learning. TTS, unless there is a carefully curated corpus is unlikely to produce good enough output.

That's a good point, yeah. For sure there are few TTS options that are good enough. Amazon Polly is the one currently used, it's actually the same one Duolingo uses.

Yeah, unfortunately it isn't free/open-source either in terms of data or code, so isn't really usable or extensible to most languages. I would hesitate before introducing features that necessarily result in two-tiers of languages, the "with commercial support" and "without commercial support".

kantord commented 3 years ago

I would hesitate before introducing features that necessarily result in two-tiers of languages, the "with commercial support" and "without commercial support".

I would also say that our main focus should be supporting features that can be shipped for all languages.

But at the same time, if we are too selective, and really only support things that can provide equal quality for all languages, then we might severely limit the overall feature-set :thinking:

ftyers commented 3 years ago

I agree that is a concern, but I think it is ok if we ship free/open-source solutions for let's say "advanced" features, because in principle they can be extended to all languages, I just wouldn't want to rely on anything proprietary. And I would build on existing projects that are language-inclusive rather than rely on language-exclusive projects.

davidak commented 3 years ago

https://www.dict.cc/ has recordings, but they are not open.

Maybe you can convince Paul to publish them under an open license. It is made by the community after all.

kantord commented 3 years ago

I am starting to look into this now, because we are quite close to open course contribution on GitHub I think.

I am seeing a couple issues now:

Common voice

Common Voice is designed to be as a training dataset. For each language, there are multiple GBs of data that need to be downloaded as one package. That's something far from trival to deal with.

For starters, storing all of that data in git would not be practical. So we'd need a way to get audio files one-by-one. Then, we'd need some way for course authors to search this database as well.

Tatoeba

They don't really have the issues of Common Voice, but also don't have an official API. It's unclear if it would be ethical to automate fetching sentences from their website.

Here's a little guide: https://en.wiki.tatoeba.org/articles/show/faq#does-tatoeba-provide-an-api?

So they have a somewhat messy situation with licensing audios, but they publish a file that makes it possible to automate license verification. It's a roughly 40Mb file, which means it could be even shipped with a linting tool.

Attributions are required for most audios (as well as sentences). This could be automated too, because the license file also containes usernames and attribution URLs.

However the question is: what is the best way to show those attributions in a way that actually serves the author of the audio? Where would these attributions be displayed? It would potentially mean thousands of attributions. And such a list of attributions would also "trickle down" to any other project that's reusing LibreLingo

kantord commented 3 years ago

@davidak thanks! This seems like a good idea too!

ftyers commented 3 years ago

For Common Voice, the data would need to be stored somewhere. In terms of having course developers search, the text of the sentences is in a GitHub repository. Also note that for many of the languages there are not multiple gigabytes of data, just for the big ones, which potentially have different or better solutions.

About the Attributions, I think it is fine to have it accessible via an icon. e.g. you can either click or hover over a small icon at the side of the recording to get the attribution.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sinhalaquiz commented 2 years ago

Note that, if course designers want to include their own sentences, then one way would be for them to submit them to Tatoeba and to Common Voice. The Common Voice sentence collector is an online and pretty much automatic way of getting sentences into Common Voice. I believe Tatoeba has a similar system.

This sounds like a wonderful idea

Can I suggest another alternative? How about allowing course creators to publish pre-recorded audio (in a ZIP file) in their preferred location?

I see the following advantages:

  1. The course creators gain full control of their course content
  2. Languages that need to teach an alphabet/pronunciation guide will require sound fragments and 'bogus words' to properly teach the phoneme system. This would be impossible with a TTS engine and not desirable for Common Voice.

The contents of the zip file could be something like this:

./manifest.toml -- specifies info such as language id, content version and content layout version ./\<truncated sha256>/voice/audio file in whatever format ...

The purpose of truncated sha 256 is to limit the number of files in the 'root directory' and may or may not be needed. This file can be fetched and unzipped locally. Missing phrases can generate an error (after all it's the course creators data)

kantord commented 2 years ago

Languages that need to teach an alphabet/pronunciation guide will require sound fragments and 'bogus words' to properly teach the phoneme system. This would be impossible with a TTS engine and not desirable for Common Voice.

Thanks for pointing this out. This is very important input and a very important point. Are we sure though, that Common Voice doesn't accept this use case? Afaik one of the main uses for Common Voice is to demonstrate pronunciation in Wikipedia 🤔

kantord commented 2 years ago

The course creators gain full control of their course content

I agree with the sentiment that course creators should have full control over the course content, however needing to manually deal with audio files should ideally be a rare exception, rather than a rule.

I'd like to point out that it's already possible to manually "fix" audio files that are generated by TTS or sourced from another source.

I think multiple sources should be allowed for each course, such as

In addition to this, I think there's the possibility to enable users to save audio files in their own repos.

I don't think that ZIP files are the solution for this, as they would be large ZIP files and I think those cannot be diffed by git. Meaning if you have a 200MB zip files and you change it 200 times, then you have 40GB of different versions 🤔 Plus, having to deal with the ZIP file as an additional difficulty for people who work on the course.

In addition, ZIP is probably not very efficient to store audio files that are already well compressed

I think rather the solution could be something like this.

in the skill file:

# ... etc ...
- Character: "ධෛ"
  Audio: characters/dhai.mp3

and then you can create a file in the course repo audios/characters/dhai.mp3

What do you think?

davidak commented 2 years ago

Afaik one of the main uses for Common Voice is to demonstrate pronunciation in Wikipedia thinking

The main use for Common Voice is training speech recognition software like https://github.com/coqui-ai/STT.

It's kind of a dirty dataset since background noise is encouraged. That helps to train the recognition in such scenarios, but is not great to use otherwise.

As said earlier, dict.cc has recordings specially for learning pronounciation.

sinhalaquiz commented 2 years ago

I think multiple sources should be allowed for each course, such as

* different TTS providers
* tatoeba
* common voice
* any other project that has a lot of voice files

In addition to this, I think there's the possibility to enable users to save audio files in their own repos.

I was suggesting a design that fits right into the existing system. So the ZIP file (which is an implementation detail BTW) is stored in a completely different server of the language creators choosing (google driver, static page in github, whatever). It's in effect like a cached TTS provider.

                                   +-------------------+
                                   |                   |
                                   |    Some server    |
                                   |                   |
                                   |    data.zip       |
                                   |                   |
                                   +--------+----------+
                                            ^
                                            |
                                            |
                                            |
+--------------------+             +--------+---------+
|                    |             |                  |
|  Librelingo repo   +------------>+  Language Course |
|                    |             |                  |
+--------------------+             +------------------+

I don't think that ZIP files are the solution for this, as they would be large ZIP files and I think those cannot be diffed by git. Meaning if you have a 200MB zip files and you change it 200 times, then you have 40GB of different versions thinking Plus, having to deal with the ZIP file as an additional difficulty for people who work on the course.

In addition, ZIP is probably not very efficient to store audio files that are already well compressed

It doesn't have to be a ZIP. It can also be a site hosting individual audio files that librelingo can fetch if required. Actually, that might be better because downloading 40GB every time a phrase changes is bad. So yes, you're right. A zip is not the solution.

I think rather the solution could be something like this.

in the skill file:

# ... etc ...
- Character: "ධෛ"
  Audio: characters/dhai.mp3

and then you can create a file in the course repo audios/characters/dhai.mp3

What do you think?

So the way I look at it, the Skill file doesn't change. The course file now specifies a different preferred audio provider.

Something like (my yaml may be wrong)

Settings:
  Audio:
     Enabled: True
     Providers:
        - Static:
            url: drive.google.com/characters/dhai.mp3

And we have another provider in apps/librelingo_audios/librelingo_audios/update_audios.py

Bouaziz-aitd commented 8 months ago

I am trying to bring back this discussion to move some actions for implementation in Libre Lingo. I have considered two options I could effectively test. In summary I tested i) using Tatoeba database and ii) using the MMS TTS model (as you may know this is a Meta initiative).

i) Tatoeba dataset I worked with Tatoeba team and started contributing in creating a set of sentences that may be of interest to my KAB course. It is important to understand (and I did not know that before I contributed) that it is not permitted to upload single word sentences unless the one word sentence makes sense such as "Hello!" and alike. Any sentence needs to have punctuation. In any way, sentences can be saved there and work with them through an API call from python. I did several tests on existing sentences that have audio included and my own sentences and it works perfect! It is important to register the contributions with an adequat license to allow Libre Lingo to use those without any breach of rights. The contributed sentences are put under CC BY 2.0 FR while the audios are left to the contributor's discretion to shose the proper license. There is a possibility to contribute under the Liber Lingo shosen license CC BY SA 4.0 (this is what I seen in the LL repository files). In my testing I tried to go around the single word aspect. I wrote a python script that parses the words from an audio sentence. The result is not perfect and not consistent because it depends a lot on the silence length between the words. To ease a bit the work, I tried to record sentences with the first word followed with a little silence. This may work but it is extremely important that the narrator (audio contributor) puts the silence long enough to parse it with success. The downside of it is that sometimes it is difficult to have a natural audio sentence with a descent silences after the first word. I started thinking about having kind of exclamation tone after the first word to mark the gap with the following words of the audio sentence but not straight forward. This can be worked out latter if we decided to go this way. The easiest alternative is to avoid one word dictation in Libre Lingo at least for the beginning, until we proof test the solution. The advantage of using Tatoeba is because the audio is natural (human dictation) and can be done for any language that would not necessarily have a TTS developed. In anticipation I started mounting a register of those sentences with their registration ID in the dataset. ID number is used to read the audio form Tatoeba. For Libre Lingo implementation, I think Kantord had already some ideas to implement that based on text mapping. At the time the user hits the speak button, the corresponding sentence is searched in the list and the corresponding ID picked up used to connect to Tatoeba and download the mp3 file.

ii) Massively Multilingual Speech (MMS TTS) of Meta. This project has targeted more than 4000 languages around the world. they made a good deal of work for some languages but not for all of them. At this time they have trained with the equivalent of 32 hours of data more that a thousand languages among them KAB. The latter has been trained for text to speech conversion and for speech recognition. I focused on the TTS. They have a repo on Github here: https://github.com/facebookresearch/fairseq/blob/main/examples/mms/tts/tutorial/MMS_TTS_Inference_Colab.ipynb I was curious and I looked at some of the languages that are in Libre Lingo such as Occitan and Basque. The former is not trained yet but the latter is. I tried first the TTS on Google Colab for kab and the model produced descent audio in wav format ( I copied a script that has been posted for all languages that I found shared on the web). I wanted to proof the concept by including this model in my python script outside Colab and I managed to make it work. At this time I can write a sentence and let the program create the wav file that I can read. The result is not perfect but it works in general with good precision at about 95% of the cases. I noticed that the composed sentences perform better that single words. It has though some lagging as this is the time taken for the conversion. It should be noted that this solution requires to download the model which is a zipped file that should be unzipped to some location but it is done one time when the model is installed.

This summary is an excerpt from the discussion taken place in Element of the the Matrix platform.

kantord commented 6 months ago

it is not permitted to upload single word sentences unless the one word sentence makes sense such as "Hello!" and alike

ok, it is a a very important piece of information that tatoeba dos not allow single word sentences, as that would still be required for courses I think. Maybe it's not essential though, and it's still better than what we have now.

The easiest alternative is to avoid one word dictation in Libre Lingo at least for the beginning, until we proof test the solution.

I think this is totally ok

The advantage of using Tatoeba is because the audio is natural (human dictation) and can be done for any language that would not necessarily have a TTS developed.

I agree with this and I think that it would also still have an advantage when being used in addition to TTS I think

https://github.com/facebookresearch/fairseq/blob/main/examples/mms/tts/tutorial/MMS_TTS_Inference_Colab.ipynb

This is amazing! I wonder about the training process, how costly it could be to train it for a new language. Anyways at this point I think this could be a good replacement for the current TTS. Could we include the code from your notebook file in the audio fetcher? That would mean that we could generate audio files without an AWS account which I think would actually also make it easier for people to try it locally

Bouaziz-aitd commented 6 months ago

"how costly it could be to train it for a new language." From the documents I read, they used about 32 hours of speech to train the model. I don't know how but, I think it should not be too complicated. The most important item is that someone needs clean, quality dataset for a given language in order to get a good TTS. Anyhow, I can provide the Python script I used for my standalone tests. Let me know how I can provide a copy of it.