Closed benoit74 closed 6 months ago
@benoit74 I have an idea of how we can approach this issue. In entrypoint.py we can add a parser argument that scrapes videos in all languages (shown in screenshot), and we make sure that the user doesn't pass in arguments for languages and and all-languages at the same time in error handling. Then in scraper.py we would need to initialize all_languages in the class and then make sure we convert all language queries into TED language codes. Please let me know if I'm on the right track here
@JeremiahHerring
I don't think we need to add a new parser argument, we can probably just make the language argument optional, and if not set it means we do not want a specific language but all videos available (in selected topic(s) or playlist(s)).
Next when language argument is not set, we have to adapt queries that use this argument (or the derived source_languages attribute, or any other attribute) to not filter anymore by language. Adding all language codes is too cumbersome and risky (what if a new language appears and we do not support it yet in our list of TED codes). It is important to adapt both run mode: by playlist and by topic. And also to ensure the TED multi (where we create one ZIM per playlist or topic) is adapted as well.
Is that clearer? WDYT about it?
@benoit74 , I have read your reply to the question and here's what I understand:
When no language is set, automatically download any video that is found. In essence, one should use the length of the source_language
attribute as a flag before deciding to ignore or not.
Am I correct?
When no language is set, automatically download any video that is found. In essence, one should use the length of the
source_language
attribute as a flag before deciding to ignore or not.
I would prefer to base the decision on language
value rather than source_language
, since the former is the real trigger, but you've got the point yes.
@benoit74 , I have been digging through the code and I have made some fixes that should address the issue but I would require some little bit of clarification. As there is no flag to make a dry run, I had to download the output of the self.videos
with and without language specified using the commands ted2zim --playlist=134 --name="the_most_popular_ted_talks_of_all_time" --debug --languages="English,French,German"
and ted2zim --playlist=134 --name="the_most_popular_ted_talks_of_all_time" --debug
respectively.
videos_with_lang.json videos_without_langs.json
Where I would require clarification is looking at the output, there is one video link irrespective of if the --language
attribute is specified or not. However, they differ in the languages
and subtitles
attributes. Is this the expected behaviour?
Also, I would also like to propose disabling the --subtitles
flag or override to ALL
when no language is specified
Where I would require clarification is looking at the output, there is one video link irrespective of if the --language attribute is specified or not. However, they differ in the languages and subtitles attributes. Is this the expected behavior?
It looks normal, yes:
--languages
is not set seems pretty logicAlso, I would also like to propose disabling the --subtitles flag or override to ALL when no language is specified
What is the issue if we do not do this? I don't see the problem.
Okay, I think I misunderstood it a little. Still wrapping my head around all the options.
Do not hesitate to continue to ask question or speak up if what I'm saying makes no sense, you have the code under your eyes, I have memories.
Okay. Thanks for your assistance
The scraper should support the case where a user want "all" languages.
For now, it is not possible, the user has to pass the precise list of languages needed.