kantord / LibreLingo

🐢 🌎 📚 a community-owned language-learning platform
https://librelingo.app
GNU Affero General Public License v3.0
1.94k stars 213 forks source link

Sourcing audio from Lingua Libre #1464

Open kantord opened 3 years ago

kantord commented 3 years ago

Lingua Libre is an online collaborative project and tool by the Wikimedia France association, which aims to build a collaborative, multilingual, audiovisual corpus under free license.

Context:

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Also considered these features, but they are not essential for a first iteration

Describe alternatives you've considered

Technical details

How to download audios from Lingua Libre?

There are several options:

  1. Using the datasets: https://lingualibre.org/datasets/
  2. Querying Lingua Libre: https://lingualibre.org/wiki/Help:Querying_Lingua_Libre#Audio_recordings
  3. Extending the Lingua Libre bot: https://github.com/lingua-libre/Lingua-Libre-Bot

3.) is probably not a good option, because it would require introducing a lot of LibreLingo-specific logic in that bot, and it could be difficult to maintain. However, if this is compatible with Lingua Libre's features, this bot could be used to fetch lists of missing audios from LibreLingo. This should be simpler to implement if LibreLingo exposes this in a publicly hosted CSV file. (cc. @Poslovitch)

I'm wary of option 1.) because although this could be the simplest solution, it doesn't feel robust enough and it's a bit more quirky to decode metadata from the directory structure. And some of the metadata that we need might be missing here. Besides, these files don't seem to be fully up to date at all times, and would require downloading large amounts of data frequently, which could make things slower and could also create extra costs for Lingua Libre. (cc. @Poslovitch)

2.) should be, in my opinion, the main way LibreLingo's pipeline would access data from Lingua Libre. This endpoint permits querying data with SPARQL queries such as

select ?record ?recordLabel ?locutorLabel ?languageLabel ?languageLevelLabel
where {
  ?record prop:P2 entity:Q2 .
  ?record prop:P5 ?locutor .
  ?record prop:P4 ?language .

  ?locutor llp:P4 ?languageStatement .
  ?languageStatement llv:P4 ?language .
  ?languageStatement llq:P16 ?languageLevel .

  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en" .
  } 
} ORDER BY ?languageLabel ?languageLevelLabel

thought I tried to execute that query and it timed out 😸

I imagine that this could be used to:

But, there are 2 questions:

kantord commented 3 years ago

Hi @Poslovitch, sorry that I took such a long time in elaborating the issue. I think I have a rough outline now, let me know what you think about it. I tagged you in places where I have more specific questions/statements I need feedback on.

If it makes sense from your perspective then perhaps a next step would be to try out some specific SPARQL queries. I have no experience with SPARQL so I didn't try out specific things yet, but I'll look into it, when I have time. 😉

Poslovitch commented 3 years ago

Consider normalizing audio loudness in the pipeline?

It's been discussed on our end in this Phabricator ticket two years ago. From what I can tell, this has just been a suggestion and not much has been done yet. Anyway, yes, that should (and hopefully will) be considered.

Lingua Libre has it's own language codes, hence we'll need to support those in our YAML files. Question: can/should multiple Lingua Libre language codes belong to a single LibreLingo course? I think LibreLingo courses can sometimes be more generic and thus encompass multiple dialects in certain cases in a single course

It's technically not our language codes. We're using elements in our Wikibase to represent languages or dialects, so they have their own Qids (like Q21 for French) but we actually rely on the Wikidata Qid and ISO-639-3 code (if it exists) of the language in our file naming scheme (e.g. Q150 and fra for French).

For forward-compatibility purposes, I'd actually recommend you to store the Wikidata Qid. You can then either query Lingua Libre to get the Qid on Lingua Libre (see the query below), or even query Wikidata to get the BCP 47 code or whatever.

SELECT ?lang WHERE {
  ?lang prop:P2 entity:Q4. # We're querying a language
  ?lang prop:P12 "Q671198". # Wikidata Qid
}
# Returns <https://lingualibre.org/entity/Q521069>, which is the right language I was looking for

Also: keep in mind not all languages have an IETF BCP 47 code. However they'd most certainly already have a Wikidata item 😉 (that is what's happening with lorrain).

Lingua Libre has it's own language codes, hence we'll need to support those in our YAML files. Question: can/should multiple Lingua Libre language codes belong to a single LibreLingo course? I think LibreLingo courses can sometimes be more generic and thus encompass multiple dialects in certain cases in a single course

Should it occur? I don't know. Can it? Yes, indeed! I don't have a real-world example coming to mind, but some course writers might indeed consider two "dialects" that have distinct Lingua Libre elements to be variations of the language they're writing a course for 🤷‍♂️.

3.) is probably not a good option, because it would require introducing a lot of LibreLingo-specific logic in that bot, and it could be difficult to maintain. However, if this is compatible with Lingua Libre's features, this bot could be used to fetch lists of missing audios from LibreLingo. This should be simpler to implement if LibreLingo exposes this in a publicly hosted CSV file.

The bot is mainly focused on providing the files to Wikimedia wikis. But even amongst these "wikis", we have a lot of logic dedicated to only one of them sometimes.

You're right though, we could use that bot (or another one) to fetch lists of missing audios.

Or, if you plan to expose these lists in publicly available files, we could actually avoid using a bot at all and implement that list-fetching stuff in our RecordWizard. It's already able to query external services.

I'm wary of option 1.) because although this could be the simplest solution, it doesn't feel robust enough and it's a bit more quirky to decode metadata from the directory structure. And some of the metadata that we need might be missing here. Besides, these files don't seem to be fully up to date at all times, and would require downloading large amounts of data frequently, which could make things slower and could also create extra costs for Lingua Libre.

Imho, it's not robust enough on our end, and you'll have to download GBs of data at some point. That's not viable for us (we don't have mirrors), and probably not for you either.

thought I tried to execute that query and it timed out

You were asking the server to provide you data about 500k recordings. Of course that's going to time out 😁. To avoid that, we usually filter the recordings per language, locutor and/or - slightly trickier - date of recording.

Would it be possible to query a list of files that need to be deleted? Essentially a list of audio files that have been deleted relatively recently (in the last month or so?) so that they are purged from LibreLingo as well

That's an interesting question, and I'll need to ask it to our team before I can answer you (please remind me if I forgot 😅).

How to handle files that need to be updated because they were updated in Lingua Libre? Perhaps they could also be included in the purge list? Thus, if we execute the purge operation before the download, the updated files will be automatically re-downloaded.

If the files are updated (i.e. re-recorded), their recording date (P6) also is. So if you're querying the audios by date (which I really recommend you doing!), it'll pop up in your "audios that need to be downloaded" list 😉.


Hopefully I answered properly to most of your questions.

I'm really excited to see how this will end up. 🚀

kantord commented 3 years ago

Regarding audio normalization, I think that should probably be part of the pipeline at LibreLingo itself, because audios might come from different sources that might be pre-normalized at different levels.

kantord commented 3 years ago

@allcontributors please add @Poslovitch for ideas

allcontributors[bot] commented 3 years ago

@kantord

I've put up a pull request to add @Poslovitch! :tada:

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.