openzim / ted

Provide the best of TED.com for offline usage!
https://download.kiwix.org/zim/ted/
GNU General Public License v3.0
13 stars 9 forks source link

Add support for grabing all videos, no matter the language #171

Closed benoit74 closed 6 months ago

benoit74 commented 6 months ago

The scraper should support the case where a user want "all" languages.

For now, it is not possible, the user has to pass the precise list of languages needed.

JeremiahHerring commented 6 months ago

@benoit74 I have an idea of how we can approach this issue. In entrypoint.py we can add a parser argument that scrapes videos in all languages (shown in screenshot), and we make sure that the user doesn't pass in arguments for languages and and all-languages at the same time in error handling. Then in scraper.py we would need to initialize all_languages in the class and then make sure we convert all language queries into TED language codes. Please let me know if I'm on the right track here

image
benoit74 commented 6 months ago

@JeremiahHerring

I don't think we need to add a new parser argument, we can probably just make the language argument optional, and if not set it means we do not want a specific language but all videos available (in selected topic(s) or playlist(s)).

Next when language argument is not set, we have to adapt queries that use this argument (or the derived source_languages attribute, or any other attribute) to not filter anymore by language. Adding all language codes is too cumbersome and risky (what if a new language appears and we do not support it yet in our list of TED codes). It is important to adapt both run mode: by playlist and by topic. And also to ensure the TED multi (where we create one ZIM per playlist or topic) is adapted as well.

Is that clearer? WDYT about it?

elfkuzco commented 6 months ago

@benoit74 , I have read your reply to the question and here's what I understand: When no language is set, automatically download any video that is found. In essence, one should use the length of the source_language attribute as a flag before deciding to ignore or not. Am I correct?

benoit74 commented 6 months ago

When no language is set, automatically download any video that is found. In essence, one should use the length of the source_language attribute as a flag before deciding to ignore or not.

I would prefer to base the decision on language value rather than source_language, since the former is the real trigger, but you've got the point yes.

elfkuzco commented 6 months ago

@benoit74 , I have been digging through the code and I have made some fixes that should address the issue but I would require some little bit of clarification. As there is no flag to make a dry run, I had to download the output of the self.videos with and without language specified using the commands ted2zim --playlist=134 --name="the_most_popular_ted_talks_of_all_time" --debug --languages="English,French,German" and ted2zim --playlist=134 --name="the_most_popular_ted_talks_of_all_time" --debug respectively.

videos_with_lang.json videos_without_langs.json

Where I would require clarification is looking at the output, there is one video link irrespective of if the --language attribute is specified or not. However, they differ in the languages and subtitles attributes. Is this the expected behaviour?

Also, I would also like to propose disabling the --subtitles flag or override to ALL when no language is specified

benoit74 commented 6 months ago

Where I would require clarification is looking at the output, there is one video link irrespective of if the --language attribute is specified or not. However, they differ in the languages and subtitles attributes. Is this the expected behavior?

It looks normal, yes:

Also, I would also like to propose disabling the --subtitles flag or override to ALL when no language is specified

What is the issue if we do not do this? I don't see the problem.

elfkuzco commented 6 months ago

Okay, I think I misunderstood it a little. Still wrapping my head around all the options.

benoit74 commented 6 months ago

Do not hesitate to continue to ask question or speak up if what I'm saying makes no sense, you have the code under your eyes, I have memories.

elfkuzco commented 6 months ago

Okay. Thanks for your assistance