openzim / ted

Provide the best of TED.com for offline usage!
https://download.kiwix.org/zim/ted/
GNU General Public License v3.0
13 stars 9 forks source link

TED scrapper seems to be broken #3

Closed kelson42 closed 6 years ago

kelson42 commented 6 years ago

... due to changes to TED web site?!

$ ./ted.py 
Traceback (most recent call last):
  File "./ted.py", line 89, in <module>
    App()
  File "./ted.py", line 18, in __init__
    self.run()
  File "./ted.py", line 50, in run
    scraper.extract_all_video_links()
  File "/media/kelson/SOTOKI/ted/scraper/webscraper.py", line 81, in extract_all_video_links
    self.extract_videos()
  File "/media/kelson/SOTOKI/ted/scraper/webscraper.py", line 94, in extract_videos
    self.extract_video_info(url)
  File "/media/kelson/SOTOKI/ted/scraper/webscraper.py", line 117, in extract_video_info
    speaker = json_data['talks'][0]['speaker']
KeyError: 'talks'
kelson42 commented 6 years ago

@rashiq Would you have time to have a look?

rashiq commented 6 years ago

@kelson will du over the weekend!

kelson42 commented 6 years ago

@rashiq Great, let me know at the time I can test the patch

rashiq commented 6 years ago

@kelson42 You can test now. I believe it's fixed.

kelson42 commented 6 years ago

@rashiq Thx, it seems to work better but dies later now:

Downloading video thumbnail... A pro wrestler's guide to confidence
Downloading video... A precise, three-word address for every place on earth
Downloading speaker image... A precise, three-word address for every place on earth
Downloading video thumbnail... A precise, three-word address for every place on earth
Downloading video... Portraits that transform people into whatever they want to be
Downloading speaker image... Portraits that transform people into whatever they want to be
Downloading video thumbnail... Portraits that transform people into whatever they want to be
Downloading video... The new age of corporate monopolies
Downloading speaker image... The new age of corporate monopolies
Downloading video thumbnail... The new age of corporate monopolies
Downloading video... We can hack our immune cells to fight cancer
Downloading speaker image... We can hack our immune cells to fight cancer
Downloading video thumbnail... We can hack our immune cells to fight cancer
Traceback (most recent call last):
  File "./scraper/ted.py", line 89, in <module>
    App()
  File "./scraper/ted.py", line 18, in __init__
    self.run()
  File "./scraper/ted.py", line 54, in run
    scraper.render_welcome_page()
  File "/media/kelson/SOTOKI/ted/scraper/webscraper.py", line 270, in render_welcome_page
    template = env.get_template('welcome.html')
  File "/media/kelson/SOTOKI/ted/venv/local/lib/python2.7/site-packages/jinja2/environment.py", line 791, in get_template
    return self._load_template(name, self.make_globals(globals))
  File "/media/kelson/SOTOKI/ted/venv/local/lib/python2.7/site-packages/jinja2/environment.py", line 765, in _load_template
    template = self.loader.load(self, name, globals)
  File "/media/kelson/SOTOKI/ted/venv/local/lib/python2.7/site-packages/jinja2/loaders.py", line 113, in load
    source, filename, uptodate = self.get_source(environment, name)
  File "/media/kelson/SOTOKI/ted/venv/local/lib/python2.7/site-packages/jinja2/loaders.py", line 178, in get_source
    raise TemplateNotFound(template)
jinja2.exceptions.TemplateNotFound: welcome.html
rashiq commented 6 years ago

@kelson42 as discussed, please call it the script from the scraper directory

kelson42 commented 6 years ago

Unfortunately this does not work properly. Like you can see, we have almost no video files... and for some reason the ZIM files are not created (but if I call the zimwriterfs command line by hand, it works):

Output #0, webm, to '/media/kelson/SOTOKI/ted/scraper/../build/TED/html/technology/3366/video.webm':
  Metadata:
    major_brand     : isom
    minor_version   : 1
    compatible_brands: isom
    category        : Higher Education
    podcast         : 1
    media_type      : 0
    title           : TED: Chris Sheldrick (2017 Global)
    artist          : TED
    date            : 2017
    album           : TEDTalks
    comment         : To learn more about this speaker, find other TEDTalks, and subscribe to this Podcast series, visit www.TED.com
                    : Feedback: contact@ted.com
    genre           : Podcast
    encoder         : Lavf57.71.100
    Stream #0:0(und): Video: vp8 (libvpx), yuv420p(progressive), 480x270 [SAR 1:1 DAR 16:9], q=30-42, 300 kb/s, 25 fps, 1k tbn, 25 tbc (default)
    Metadata:
      creation_time   : 2016-10-24T21:04:59.000000Z
      handler_name    : VideoHandler
      encoder         : Lavc57.89.100 libvpx
    Side data:
      cpb: bitrate max/min/avg: 0/0/0 buffer size: 1000000 vbv_delay: -1
    Stream #0:1(und): Audio: vorbis (libvorbis), 44100 Hz, mono, fltp, 128 kb/s (default)
    Metadata:
      creation_time   : 2017-10-16T18:47:29.000000Z
      handler_name    : GPAC ISO Audio Handler
      encoder         : Lavc57.89.100 libvorbis
frame= 8064 fps= 97 q=0.0 Lsize=    9422kB time=00:05:22.65 bitrate= 239.2kbits/s speed=3.89x    
video:4151kB audio:5037kB subtitle:0kB other streams:0kB global headers:4kB muxing overhead: 2.541558%
Converting Video... A precise, three-word address for every place on earth
Resizing /media/kelson/SOTOKI/ted/scraper/../build/TED/html/technology/3579/thumbnail.jpg
Resizing /media/kelson/SOTOKI/ted/scraper/../build/TED/html/technology/3595/thumbnail.jpg
Resizing /media/kelson/SOTOKI/ted/scraper/../build/TED/html/technology/3573/thumbnail.jpg
Resizing /media/kelson/SOTOKI/ted/scraper/../build/TED/html/technology/3366/thumbnail.jpg
Resizing /media/kelson/SOTOKI/ted/scraper/../build/TED/html/business/3595/thumbnail.jpg
Resizing /media/kelson/SOTOKI/ted/scraper/../build/TED/html/business/3585/thumbnail.jpg
Resizing /media/kelson/SOTOKI/ted/scraper/../build/TED/html/business/2890/thumbnail.jpg
Resizing /media/kelson/SOTOKI/ted/scraper/../build/TED/html/science/3583/thumbnail.jpg
Resizing /media/kelson/SOTOKI/ted/scraper/../build/TED/html/science/3587/thumbnail.jpg
Resizing /media/kelson/SOTOKI/ted/scraper/../build/TED/html/science/3579/thumbnail.jpg
Creating ZIM files
    Writting ZIM for TED talks - Technology
zimwriterfs --welcome="index.html" --favicon="favicon.png" --language="eng" --title="TED talks - Technology" --description="Ideas worth spreading" --creator="TED" --publisher="Kiwix" "/media/kelson/SOTOKI/ted/scraper/../build/TED/html/technology" "/media/kelson/SOTOKI/ted/scraper/../build/TED/zim/ted_en_technology_2017-11.zim"
Successfuly created ZIM file at /media/kelson/SOTOKI/ted/scraper/../build/TED/zim/ted_en_technology_2017-11.zim
    Writting ZIM for TED talks - Entertainment
zimwriterfs --welcome="index.html" --favicon="favicon.png" --language="eng" --title="TED talks - Entertainment" --description="Ideas worth spreading" --creator="TED" --publisher="Kiwix" "/media/kelson/SOTOKI/ted/scraper/../build/TED/html/entertainment" "/media/kelson/SOTOKI/ted/scraper/../build/TED/zim/ted_en_entertainment_2017-11.zim"
Successfuly created ZIM file at /media/kelson/SOTOKI/ted/scraper/../build/TED/zim/ted_en_entertainment_2017-11.zim
    Writting ZIM for TED talks - Design
zimwriterfs --welcome="index.html" --favicon="favicon.png" --language="eng" --title="TED talks - Design" --description="Ideas worth spreading" --creator="TED" --publisher="Kiwix" "/media/kelson/SOTOKI/ted/scraper/../build/TED/html/design" "/media/kelson/SOTOKI/ted/scraper/../build/TED/zim/ted_en_design_2017-11.zim"
Successfuly created ZIM file at /media/kelson/SOTOKI/ted/scraper/../build/TED/zim/ted_en_design_2017-11.zim
    Writting ZIM for TED talks - Business
zimwriterfs --welcome="index.html" --favicon="favicon.png" --language="eng" --title="TED talks - Business" --description="Ideas worth spreading" --creator="TED" --publisher="Kiwix" "/media/kelson/SOTOKI/ted/scraper/../build/TED/html/business" "/media/kelson/SOTOKI/ted/scraper/../build/TED/zim/ted_en_business_2017-11.zim"
Successfuly created ZIM file at /media/kelson/SOTOKI/ted/scraper/../build/TED/zim/ted_en_business_2017-11.zim
    Writting ZIM for TED talks - Science
zimwriterfs --welcome="index.html" --favicon="favicon.png" --language="eng" --title="TED talks - Science" --description="Ideas worth spreading" --creator="TED" --publisher="Kiwix" "/media/kelson/SOTOKI/ted/scraper/../build/TED/html/science" "/media/kelson/SOTOKI/ted/scraper/../build/TED/zim/ted_en_science_2017-11.zim"
Successfuly created ZIM file at /media/kelson/SOTOKI/ted/scraper/../build/TED/zim/ted_en_science_2017-11.zim
    Writting ZIM for TED talks - Global issues
zimwriterfs --welcome="index.html" --favicon="favicon.png" --language="eng" --title="TED talks - Global issues" --description="Ideas worth spreading" --creator="TED" --publisher="Kiwix" "/media/kelson/SOTOKI/ted/scraper/../build/TED/html/global issues" "/media/kelson/SOTOKI/ted/scraper/../build/TED/zim/ted_en_global_issues_2017-11.zim"
Successfuly created ZIM file at /media/kelson/SOTOKI/ted/scraper/../build/TED/zim/ted_en_global_issues_2017-11.zim
(venv) kelson@camber:/media/kelson/SOTOKI/ted/scraper$ find video.webm
find: ‘video.webm’: No such file or directory
(venv) kelson@camber:/media/kelson/SOTOKI/ted/scraper$
rashiq commented 6 years ago

@kelson42 hmm interesting, will take a look

kelson42 commented 6 years ago

@rashiq Do you think you would have time to have a look to this bug this WE?

rashiq commented 6 years ago

@kelson42 I'm on the issue, looks like ted retagged their videos with a lot more categories that we have hardcoded, which in turn messes with our encoding

kelson42 commented 6 years ago

@rashiq Thank you! Not so surprising over years.

kelson42 commented 6 years ago
$ find . -name "*webm"
./build/TED/html/technology/3579/video.webm
./build/TED/html/technology/3595/video.webm
./build/TED/html/technology/3573/video.webm
./build/TED/html/technology/3366/video.webm
./build/TED/html/business/3595/video.webm
./build/TED/html/business/3585/video.webm
./build/TED/html/business/2890/video.webm
./build/TED/html/science/3583/video.webm
./build/TED/html/science/3587/video.webm
./build/TED/html/science/3579/video.webm
dattaz commented 6 years ago

We got rate limit when scrapping video page. Solution 1 : add small time between requests, something like 5 seconds but it's slow (5nb_video + 5nb_page ~= 4h just to download all metadata ) Solution 2 : Read requests status code and sleep a small time when http code is 429 Solution 3 : Find how many requests we can made without sleeping, and then get a counter and sleep when this limit is reached.

kelson42 commented 6 years ago

Solution2 seems to me the best one because we have no guarantee in the future that the rate will not change again.

kelson42 commented 6 years ago

@dattaz Your branch seemed to work fine... but after after 24h the script died with the following error. Maybe your fix is still not 100% perfect ;)

Downloading video... Beats that defy boxes
Downloading speaker image... Beats that defy boxes
Downloading video thumbnail... Beats that defy boxes
Downloading video... HIV -- how to fight an epidemic of bad laws
Downloading speaker image... HIV -- how to fight an epidemic of bad laws
Downloading video thumbnail... HIV -- how to fight an epidemic of bad laws
Downloading video... The journey across the high wire
Traceback (most recent call last):
  File "./scraper/ted.py", line 89, in <module>
    App()
  File "./scraper/ted.py", line 18, in __init__
    self.run()
  File "./scraper/ted.py", line 53, in run
    scraper.download_video_data()
  File "/media/kelson/SOTOKI/ted/scraper/webscraper.py", line 460, in download_video_data
    raise e
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='download.ted.com', port=443): Max retries exceeded with url: /talks/PhilippePetit_2012.mp4?apikey=489b859150fc58263f17110eeb44ed5fba4a3b22 (Caused by <class 'socket.error'>: [Errno 111] Connection refused)
kelson42 commented 6 years ago

Now it dies with

Downloading speaker image... A hospital tour in Nigeria
Downloading video thumbnail... A hospital tour in Nigeria
Downloading video... Moving sculpture
Downloading speaker image... Moving sculpture
Downloading video thumbnail... Moving sculpture
Downloading video... Designing objects that tell stories
Downloading speaker image... Designing objects that tell stories
Downloading video thumbnail... Designing objects that tell stories
Downloading video... The astonishing hidden world of the deep ocean
Downloading speaker image... The astonishing hidden world of the deep ocean
Traceback (most recent call last):
  File "./scraper/ted.py", line 89, in <module>
    App()
  File "./scraper/ted.py", line 18, in __init__
    self.run()
  File "./scraper/ted.py", line 53, in run
    scraper.download_video_data()
  File "/media/kelson/SOTOKI/ted/scraper/webscraper.py", line 472, in download_video_data
    r = utils.download_from_site(video_speaker)
  File "/media/kelson/SOTOKI/ted/scraper/utils.py", line 52, in download_from_site
    r = requests.get(url, headers = headers)
  File "/media/kelson/SOTOKI/ted/venv/local/lib/python2.7/site-packages/requests/api.py", line 55, in get
    return request('get', url, **kwargs)
  File "/media/kelson/SOTOKI/ted/venv/local/lib/python2.7/site-packages/requests/api.py", line 44, in request
    return session.request(method=method, url=url, **kwargs)
  File "/media/kelson/SOTOKI/ted/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 349, in request
    prep = self.prepare_request(req)
  File "/media/kelson/SOTOKI/ted/venv/local/lib/python2.7/site-packages/requests/sessions.py", line 287, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "/media/kelson/SOTOKI/ted/venv/local/lib/python2.7/site-packages/requests/models.py", line 287, in prepare
    self.prepare_url(url, params)
  File "/media/kelson/SOTOKI/ted/venv/local/lib/python2.7/site-packages/requests/models.py", line 338, in prepare_url
    "Perhaps you meant http://{0}?".format(url))
requests.exceptions.MissingSchema: Invalid URL u'/images/default_254x191.jpg': No schema supplied. Perhaps you meant http:///images/default_254x191.jpg?
rashiq commented 6 years ago

@kelson42 does it die right away for the first speaker image?