add support for audio language filter

rgaudin commented 4 years ago

After discussing with @kelson42 it seems that there is eventually a need for having a language filter on the audio (despite what we said on #34).

So as suggested there, we should also have the ability to filter by audio language.

@satyamtg, in the light of youtube#38 and to not keep the params nb too high/overly complex, could you propose a new set of params around the language and subtitles filtering ?

satyamtg commented 4 years ago

@rgaudin, in my opinion, we should combine the params --language and --only-videos-in as passing a language code 2 times in 2 formats (which would mostly be the use case) doesn't make much sense (although this needs to have a conversion from iso639-3 to iso639-1 and this may be complex considering we have country codes required wherever necessary). Also, after that, we can have a common argument say --filterbylang which may take in a list of the following constants -

SUBS - Filter out subtitles based on supplied language
VID - Filter out videos based on supplied language (the thing that we do now)
AUD - Further filter out videos based on the native audio

I think one of VID and AUD would do. Maybe we get better names for these.

Apart from these names sounding weird, what do you feel about this?

rgaudin commented 4 years ago

Hum, I don't think we can do that. If I don't want to filter by language, I can't with this system as I have to supply a language for the ZIM file (or maybe we default it to eng if not supplied).

We already have iso_639_3 to iso_639_1 in zimscraperlib. That should be OK as TED uses 639-1, right?

Also I think this does not account for multiple languages (beside default to unfiltered) and I think we already have tickets for this. Should --language accept a list? If so, the ZIM lang would be mul.

That said, I think your --filterbylang system is not user-friendly enough ; this is not ffmpeg :)

Here's another proposal:

--languages fr,es (optional), otherwise no filtering.
default to only videos with audio matching languages.
--subtitles-enough (boolean) to include videos with matching subtitles
--subtitles all or matching (default) or ar,pt

It fulfills the main requirements:

[x] include all subtitles or only those of LANG?
[x] include all videos or only those that have either a subtitle or audio in LANG?
[x] include all videos that have the audio of LANG

what do you all think?

satyamtg commented 4 years ago

Hum, I don't think we can do that. If I don't want to filter by language, I can't with this system as I have to supply a language for the ZIM file (or maybe we default it to eng if not supplied).

@rgaudin I think we can. just omit passing the --filterbylang . Also, if we don't want to include all videos but we want all subtitles, we could go with --filterbylang=[VID], or if we wanted like filtered subtitles and filtered videos, we could go with --filterbylang=[SUBS, VID]. To be clear it's the presence/absence of these constants that'd define the behavior in what I proposed.

We already have iso_639_3 to iso_639_1 in zimscraperlib. That should be OK as TED uses 639-1, right?

That would work; I meant to go with this scraperlib thing when talking of the conversion. However, to clarify, this won't be enough as there are some languages such as Chinese which have variations, I mean ted supports zh, zh-cn, zh-tw. If we see carefully, zh has fewer videos translated than zh-tw for instance.

Also I think this does not account for multiple languages (beside default to unfiltered) and I think we already have tickets for this. Should --language accept a list? If so, the ZIM lang would be mul.

I was actually in favour of --language taking in a list and zim lang being mul. In my approach I actually wanted to go the ffmpeg way (you got it !)

That said, I think your --filterbylang system is not user-friendly enough ; this is not ffmpeg :)

I agree. It's not that intuitive. Your approach certainly feels better.

So, should we go with combining --languages and what's currently done by --only-videos-in or keep this behaviour separate between --language and --languages?

rgaudin commented 4 years ago

Hum yeah I think we’ll have to introduce some magic as TED is not just iso 639-1. We could be flexible and accept either iso or locale and attempt to manage that. In anyway this would happen on start and fail early if we can’t convert so that ok. If the user request zh-cn we can find what he wants and if he just supplies zh we’d include all chinese variants.

I agree with should definitely merge into a single —language ... given we supply the features mentioned above.

satyamtg commented 4 years ago

Hum yeah I think we’ll have to introduce some magic as TED is not just iso 639-1. We could be flexible and accept either iso or locale and attempt to manage that. In anyway this would happen on start and fail early if we can’t convert so that ok.

If the user request zh-cn we can find what he wants and if he just supplies zh we’d include all chinese variants.

I agree with should definitely merge into a single —language ... given we supply the features mentioned above.

After going through all the languages TED uses (it took some time though :sweat_smile: ), here is a list of languages that ted uses and are not fully ISO639-1 .

Algerian Arabic - arq - ISO639-3
Asturian - ast - ISO639-3
Cebuano - ceb - ISO639-3
Chinese, Simplified - zh-cn - Not fully ISO639
Chinese, Traditional - zh-tw - Not fully ISO639
Filipino - fil - ISO639-3
French (Canada) - fr-ca - Not fully ISO639
Hakha Chin - cnh - ISO639-3 but not present in the iso639 python library that we use
Hupa - hup - ISO639-3
Ingush - inh - ISO639-3
Klingon - tlh - ISO639-3
Latgalian - ltg - ISO639-3 but not present in the iso639 python library that we use
Mauritian Creole - mfe - ISO639-3 but not present in the iso639 python library that we use
Macedo - rup - ISO639-3 - TED has no videos in this language
Montenegrin - srp - ISO639-3
Português brasileiro - pt-br - Not fully ISO639
Rusyn - ry - ISO639-3 is rue but ry is not officially assigned
Serbo-Croatian - sh - Deprecated ISO639-1 (more here)
Silesian - szl - ISO639-3 but not present in the iso639 python library that we use
Swedish Chef - art-x-bork - Really? Is TED serious? - TED has no videos in this language
Tunisian Arabic - aeb - Member of macrolanguage Arabic. ISO639-3

From this list we can conclude that if we allow users to enter any language TED uses, we can find the ISO639-3 equivalent in the following way -

If it is a 2 letter code - convert to ISO639-3 using library
If it is 2 letter then a '-' - convert the first 2 letters into ISO639-3
For the 3 lettered languages, we can use them as is. Maybe a better library would be this in scraperlib as it has all the missing ISO639 languages when setting up localization. However this doesn't have the native script.

In my opinion we can harmonize this behaviour across scrapers by two methods -

--languages takes in arguments from the languages in which the content is available (like what I suggested just above this line)
--languages takes ISO639-3 codes and we have a list of variations (which is scraper specific) in scrapers and get the content in all variations. Eg. - If user enters chi, we have the converted ISO639-1 as zh and we include all the 3 variants zh, zh-tw and zh-cn. We need to modify the data.js a bit to store all the variants of the title and descriptions. In this the user won't be able to get only zh-tw or zh-cn though.

I am in favour of the 2nd method as we have variations in only subtitles. Audio language is never translated by TED. Anyways, for now we can support this using the aforementioned way.

Lastly, we can do some easter egg for art-x-bork but only after we have finished all the higher priority tasks.

What do you suggest @rgaudin ?

rgaudin commented 4 years ago

Thanks for all this.

I don't think we want people to input TED language codes, as much as possible of course. I think we should have a solid mechanism in zimscraperlib to convert user input to 639-3, 639-1 and let the scrapers deal with the remaining details.

That said, it's not possible to convert from ISO to all TED languages as you clearly exposed. I think the --languages help should mention that it accepts locale names, 639-3 and 639-1 and that the user should input what's most relevant. User of languages with important variants like chinese or arabic would know what to input while others would be OK with a simple 639-1.

Having a strong conversion mechanism at the lib level should ease the scraper-specific code of getting what we need. We should thus look into using the iso-639 lib you mentioned for this. I've look it up and despite being old, it seems like a better fit.

My question now is, I understand we are getting localized versions of titles and descriptions. That's great for a single language but what about multiple languages? We could have a language select in the UI if we have multiple languages and that would trigger the change but as you mentioned, that would require a change to the data.js and stuff. Maybe we can just use the first listed language for now and open a ticket for implementing this separately.

satyamtg commented 4 years ago

My question now is, I understand we are getting localized versions of titles and descriptions. That's great for a single language but what about multiple languages? We could have a language select in the UI if we have multiple languages and that would trigger the change but as you mentioned, that would require a change to the data.js and stuff. Maybe we can just use the first listed language for now and open a ticket for implementing this separately.

Actually in the audio_filtering branch that I have pushed, I have taken care of this. Although it's not ready right now but I have done some changes in data.js. For now what I am trying to achieve is as follows -

Previously, the language selection in the UI was based upon subtitle language. Now, since we are changing the subtitle behaviour with much flexible options, I'm changing it to take language passed by the param or subtitle language if no language is passed. Maybe we can later have a subtitle selection dropdown in the UI as the title and description would be taken in the required language and it would be dynamic based on the language dropdown.
For now the --only-videos-in param and --language params are not combined. I think it'd be better if we combine once we have a robust system in scraperlib for languages

rgaudin commented 4 years ago

Ah OK that's great.

I think having a single selector is enough. If I understand correctly, it would filter the content within the UI to only the videos that have audio or subtitle in the given language. That seems like the most reasonable option but the label should be explicit on what the selector does.

As for the second point, I agree we should build that directly on zimscraperlib. We already have the tests use case above in this issue.

openzim / ted

add support for audio language filter #57