Sourcing audio from Lingua Libre

kantord commented 3 years ago

Lingua Libre is an online collaborative project and tool by the Wikimedia France association, which aims to build a collaborative, multilingual, audiovisual corpus under free license.

Context:

This partially overlaps with: https://github.com/kantord/LibreLingo/issues/714
Lingua Libre mostly contains words rather than phrases, although phrases are also supported. Therefore it will be most relevant once this issue is resolved: https://github.com/kantord/LibreLingo/issues/617
Loosely related feature request than might share a lot of details https://github.com/kantord/LibreLingo/issues/1473

Is your feature request related to a problem? Please describe.

related to the issues described here: https://github.com/kantord/LibreLingo/issues/714
there's currently no way for users to upload their own audio
we don't want to develop our own user interface for uploading audio

Describe the solution you'd like

create a webpage that lists missing audios that can be uploaded through Lingua Libre for each course
have a separate repository for managing and hosting audio files
have a pipeline that periodically checks if new audio has been uploaded to Lingua Libre, and download it
the pipeline needs to avoid excessively calling external API/creating avoidable traffic for external servers
have a way to customize which audio belongs to which phrase/word based on an ID. This can be optional and by default matched based on the text and a random choice between available alternatives
automatically check license compatibility (based on the audio license and the course license)

Also considered these features, but they are not essential for a first iteration

Have a way to give feedback about audios that will be automatically redirected to Lingua Libre
Somehow integrate LibreLingo's missing audio list into Lingua Libre's upload system

Describe alternatives you've considered

other free sources for audios, such as Tatoeba should be implemented additionally as there's no reason for LibreLingo to miss out on good sources for audio

Technical details

A lot of the technical details from https://github.com/kantord/LibreLingo/issues/1473 probably also apply here
There should be a more or less unified system for fetching audios from different sources such as different TTS engines and Lingua Libre
Consider normalizing audio loudness in the pipeline? 🤔 (cc. @Poslovitch)
Lingua Libre has it's own language codes, hence we'll need to support those in our YAML files. Question: can/should multiple Lingua Libre language codes belong to a single LibreLingo course? I think LibreLingo courses can sometimes be more generic and thus encompass multiple dialects in certain cases in a single course 🤔 (cc. @Poslovitch)

How to download audios from Lingua Libre?

There are several options:

Using the datasets: https://lingualibre.org/datasets/
Querying Lingua Libre: https://lingualibre.org/wiki/Help:Querying_Lingua_Libre#Audio_recordings
Extending the Lingua Libre bot: https://github.com/lingua-libre/Lingua-Libre-Bot

3.) is probably not a good option, because it would require introducing a lot of LibreLingo-specific logic in that bot, and it could be difficult to maintain. However, if this is compatible with Lingua Libre's features, this bot could be used to fetch lists of missing audios from LibreLingo. This should be simpler to implement if LibreLingo exposes this in a publicly hosted CSV file. (cc. @Poslovitch)

I'm wary of option 1.) because although this could be the simplest solution, it doesn't feel robust enough and it's a bit more quirky to decode metadata from the directory structure. And some of the metadata that we need might be missing here. Besides, these files don't seem to be fully up to date at all times, and would require downloading large amounts of data frequently, which could make things slower and could also create extra costs for Lingua Libre. (cc. @Poslovitch)

2.) should be, in my opinion, the main way LibreLingo's pipeline would access data from Lingua Libre. This endpoint permits querying data with SPARQL queries such as

select ?record ?recordLabel ?locutorLabel ?languageLabel ?languageLevelLabel
where {
  ?record prop:P2 entity:Q2 .
  ?record prop:P5 ?locutor .
  ?record prop:P4 ?language .

  ?locutor llp:P4 ?languageStatement .
  ?languageStatement llv:P4 ?language .
  ?languageStatement llq:P16 ?languageLevel .

  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" .
  } 
} ORDER BY ?languageLabel ?languageLevelLabel

thought I tried to execute that query and it timed out 😸

I imagine that this could be used to:

Query a list of only the missing audios so that we can get metadata and links to download the audios in an optimized way.
Download all missing audio files based on the results of that query

But, there are 2 questions:

Would it be possible to query a list of files that need to be deleted? Essentially a list of audio files that have been deleted relatively recently (in the last month or so?) so that they are purged from LibreLingo as well
How to handle files that need to be updated because they were updated in Lingua Libre? Perhaps they could also be included in the purge list? Thus, if we execute the purge operation before the download, the updated files will be automatically re-downloaded.

(cc. @Poslovitch)

kantord commented 3 years ago

Hi @Poslovitch, sorry that I took such a long time in elaborating the issue. I think I have a rough outline now, let me know what you think about it. I tagged you in places where I have more specific questions/statements I need feedback on.

If it makes sense from your perspective then perhaps a next step would be to try out some specific SPARQL queries. I have no experience with SPARQL so I didn't try out specific things yet, but I'll look into it, when I have time. 😉

Poslovitch commented 3 years ago

Consider normalizing audio loudness in the pipeline?

It's been discussed on our end in this Phabricator ticket two years ago. From what I can tell, this has just been a suggestion and not much has been done yet. Anyway, yes, that should (and hopefully will) be considered.

Lingua Libre has it's own language codes, hence we'll need to support those in our YAML files. Question: can/should multiple Lingua Libre language codes belong to a single LibreLingo course? I think LibreLingo courses can sometimes be more generic and thus encompass multiple dialects in certain cases in a single course

It's technically not our language codes. We're using elements in our Wikibase to represent languages or dialects, so they have their own Qids (like Q21 for French) but we actually rely on the Wikidata Qid and ISO-639-3 code (if it exists) of the language in our file naming scheme (e.g. Q150 and fra for French).

For forward-compatibility purposes, I'd actually recommend you to store the Wikidata Qid. You can then either query Lingua Libre to get the Qid on Lingua Libre (see the query below), or even query Wikidata to get the BCP 47 code or whatever.

SELECT ?lang WHERE {
  ?lang prop:P2 entity:Q4. # We're querying a language
  ?lang prop:P12 "Q671198". # Wikidata Qid
}
# Returns <https://lingualibre.org/entity/Q521069>, which is the right language I was looking for

Also: keep in mind not all languages have an IETF BCP 47 code. However they'd most certainly already have a Wikidata item 😉 (that is what's happening with lorrain).

Lingua Libre has it's own language codes, hence we'll need to support those in our YAML files. Question: can/should multiple Lingua Libre language codes belong to a single LibreLingo course? I think LibreLingo courses can sometimes be more generic and thus encompass multiple dialects in certain cases in a single course

Should it occur? I don't know. Can it? Yes, indeed! I don't have a real-world example coming to mind, but some course writers might indeed consider two "dialects" that have distinct Lingua Libre elements to be variations of the language they're writing a course for 🤷‍♂️.

3.) is probably not a good option, because it would require introducing a lot of LibreLingo-specific logic in that bot, and it could be difficult to maintain. However, if this is compatible with Lingua Libre's features, this bot could be used to fetch lists of missing audios from LibreLingo. This should be simpler to implement if LibreLingo exposes this in a publicly hosted CSV file.

The bot is mainly focused on providing the files to Wikimedia wikis. But even amongst these "wikis", we have a lot of logic dedicated to only one of them sometimes.

You're right though, we could use that bot (or another one) to fetch lists of missing audios.

Or, if you plan to expose these lists in publicly available files, we could actually avoid using a bot at all and implement that list-fetching stuff in our RecordWizard. It's already able to query external services.

I'm wary of option 1.) because although this could be the simplest solution, it doesn't feel robust enough and it's a bit more quirky to decode metadata from the directory structure. And some of the metadata that we need might be missing here. Besides, these files don't seem to be fully up to date at all times, and would require downloading large amounts of data frequently, which could make things slower and could also create extra costs for Lingua Libre.

Imho, it's not robust enough on our end, and you'll have to download GBs of data at some point. That's not viable for us (we don't have mirrors), and probably not for you either.

thought I tried to execute that query and it timed out

You were asking the server to provide you data about 500k recordings. Of course that's going to time out 😁. To avoid that, we usually filter the recordings per language, locutor and/or - slightly trickier - date of recording.

Would it be possible to query a list of files that need to be deleted? Essentially a list of audio files that have been deleted relatively recently (in the last month or so?) so that they are purged from LibreLingo as well

That's an interesting question, and I'll need to ask it to our team before I can answer you (please remind me if I forgot 😅).

How to handle files that need to be updated because they were updated in Lingua Libre? Perhaps they could also be included in the purge list? Thus, if we execute the purge operation before the download, the updated files will be automatically re-downloaded.

If the files are updated (i.e. re-recorded), their recording date (P6) also is. So if you're querying the audios by date (which I really recommend you doing!), it'll pop up in your "audios that need to be downloaded" list 😉.

Hopefully I answered properly to most of your questions.

I'm really excited to see how this will end up. 🚀

kantord commented 3 years ago

Regarding audio normalization, I think that should probably be part of the pipeline at LibreLingo itself, because audios might come from different sources that might be pre-normalized at different levels.

kantord commented 3 years ago

@allcontributors please add @Poslovitch for ideas

allcontributors[bot] commented 3 years ago

@kantord

I've put up a pull request to add @Poslovitch! :tada:

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

kantord / LibreLingo

Sourcing audio from Lingua Libre #1464