mjbright / futurelearn-dl

A script to download materials from the FutureLearn website (for enrolled courses)
GNU General Public License v3.0
34 stars 20 forks source link

Failing to Download PDFs Ends Download: Long outstanding issue! #14

Closed zenny closed 3 years ago

zenny commented 4 years ago

Trying to download https://www.futurelearn.com/courses/research-question and it fails while trying to download a pdf (maybe non-existent). However, is there a way to skip that specific pdf and continnue downloading? Thanks!

$ ./TEST_futurelearn-dl.py_research-question.sh 
Downloading 5-week course 'research-question'
Downloading url<https://view.vzaar.com/11288441/video>
    to file <research-question/week1/5.2-What-our-current-students-say_11288441.mp4> ...
type=mp4, content.len=36893991
Downloading url<https://view.vzaar.com/13650997/video>
    to file <research-question/week1/5.3-Studying-for-a-PhD-by-distance-learning_13650997.mp3> ...
type=mp3, content.len=6218010
Downloading url<https://view.vzaar.com/11288166/video>
    to file <research-question/week1/5.4-Professor-Martin-Parker,-Director-of-Research-in-the-School-of-Business_11288166.mp4> ...
type=mp4, content.len=78980655
Downloading url<https://view.vzaar.com/11369808/video>
    to file <research-question/week1/5.5-Professor-Kirsten-Malmkjaer,-Professor-of-Translation-Studies_11369808.mp3> ...
type=mp3, content.len=14569992
Downloading url<"http://fass.open.ac.uk/sites/fass.open.ac.uk/files/files/research/sample-research-proposal.pdf>
    to file <research-question/week1/5.15-Bringing-it-all-together_sample-research-proposal.pdf> ...
Traceback (most recent call last):
  File "./futurelearn-dl.py", line 625, in <module>
    getCourseWeekStepPage(course_id, week_id, step_id, week_num, title)
  File "./futurelearn-dl.py", line 232, in getCourseWeekStepPage
    downloadURLsInPage(course_id, week_id, step_id, week_num, content, DOWNLOAD_TYPE, page_title)
  File "./futurelearn-dl.py", line 386, in downloadURLsInPage
    downloadURLInPage(url, download_dir, DOWNLOAD_TYPE, page_title)
  File "./futurelearn-dl.py", line 452, in downloadURLInPage
    downloadURLToFile(url, ofile, DOWNLOAD_TYPE)
  File "./futurelearn-dl.py", line 405, in downloadURLToFile
    response = session.get(url, headers=headers)
  File "/home/zenny/.local/lib/python3.4/site-packages/requests/sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "/home/zenny/.local/lib/python3.4/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/zenny/.local/lib/python3.4/site-packages/requests/sessions.py", line 640, in send
    adapter = self.get_adapter(url=request.url)
  File "/home/zenny/.local/lib/python3.4/site-packages/requests/sessions.py", line 731, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '"http://fass.open.ac.uk/sites/fass.open.ac.uk/files/files/research/sample-research-proposal.pdf'

Same here: https://www.futurelearn.com/courses/mindfulness-wellbeing-performance/

Downloading 4-week course 'mindfulness-wellbeing-performance'
Downloading url<http://www.danielgilbert.com/KILLINGSWORTH%20&amp;%20GILBERT%20(2010).pdf>
    to file <mindfulness-wellbeing-performance/week1/1.4-What-is-mindfulness-and-why-does-it-matter_KILLINGSWORTH_&amp;_GILBERT_(2010).pdf> ...
downloadURLToFile: Failed to download url <http://www.danielgilbert.com/KILLINGSWORTH%20&amp;%20GILBERT%20(2010).pdf> => 403
Downloading url<"http://www.danielgilbert.com/KILLINGSWORTH%20&amp;amp;%20GILBERT%20(2010).pdf>
    to file <mindfulness-wellbeing-performance/week1/1.4-What-is-mindfulness-and-why-does-it-matter_KILLINGSWORTH_&amp;amp;_GILBERT_(2010).pdf> ...
Traceback (most recent call last):
  File "./futurelearn-dl.py", line 625, in <module>
    getCourseWeekStepPage(course_id, week_id, step_id, week_num, title)
  File "./futurelearn-dl.py", line 232, in getCourseWeekStepPage
    downloadURLsInPage(course_id, week_id, step_id, week_num, content, DOWNLOAD_TYPE, page_title)
  File "./futurelearn-dl.py", line 386, in downloadURLsInPage
    downloadURLInPage(url, download_dir, DOWNLOAD_TYPE, page_title)
  File "./futurelearn-dl.py", line 452, in downloadURLInPage
    downloadURLToFile(url, ofile, DOWNLOAD_TYPE)
  File "./futurelearn-dl.py", line 405, in downloadURLToFile
    response = session.get(url, headers=headers)
  File "/home/zenny/.local/lib/python3.4/site-packages/requests/sessions.py", line 546, in get
    return self.request('GET', url, **kwargs)
  File "/home/zenny/.local/lib/python3.4/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/zenny/.local/lib/python3.4/site-packages/requests/sessions.py", line 640, in send
    adapter = self.get_adapter(url=request.url)
  File "/home/zenny/.local/lib/python3.4/site-packages/requests/sessions.py", line 731, in get_adapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '"http://www.danielgilbert.com/KILLINGSWORTH%20&amp;amp;%20GILBERT%20(2010).pdf'
Look for new files with - find /home/zenny/Downloads//Education/FUTURELEARN/mindfulness-wellbeing-performance -type f -exec ls -altr {} \;
zenny commented 4 years ago

Downloading pdf components appears to break download from that point onwards:

Downloading 3-week course 'covid19-novel-coronavirus'
Downloading url<https://view.vzaar.com/21357702/video>
    to file <covid19-novel-coronavirus/week1/1.1-Welcome-to-the-course_21357702.mp3> ...
type=mp3, content.len=2361874
Downloading url<https://view.vzaar.com/21368871/download/sd_lower>
    to file <covid19-novel-coronavirus/week1/1.5-Overview-of-the-coronavirus-and-COVID-19_sd_lowe.mp4> ...
type=mp4, content.len=11155802
Downloading url<https://www.who.int/docs/default-source/coronaviruse/who-china-joint-mission-on-covid-19-final-report.pdf>
    to file <covid19-novel-coronavirus/week1/1.6-What-are-the-key-points-in-the-outbreak-to-date_who-china-joint-mission-on-covid-19-final-report.pdf> ...
type=pdf, content.len=1562547
Downloading url<"https://www.cell.com/cell-host-microbe/pdf/S1931-3128(20)30072-X.pdf>
    to file <covid19-novel-coronavirus/week1/1.7-How-was-the-novel-coronavirus-identified_S1931-3128(20)30072-X.pdf> ...
Traceback (most recent call last):
  File "./futurelearn-dl.py", line 625, in <module>
    getCourseWeekStepPage(course_id, week_id, step_id, week_num, title)
  File "./futurelearn-dl.py", line 232, in getCourseWeekStepPage
    downloadURLsInPage(course_id, week_id, step_id, week_num, content, DOWNLOAD_TYPE, page_title)
  File "./futurelearn-dl.py", line 386, in downloadURLsInPage
    downloadURLInPage(url, download_dir, DOWNLOAD_TYPE, page_title)
  File "./futurelearn-dl.py", line 452, in downloadURLInPage
    downloadURLToFile(url, ofile, DOWNLOAD_TYPE)
  File "./futurelearn-dl.py", line 405, in downloadURLToFile
    response = session.get(url, headers=headers)
  File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 637, in send
    adapter = self.get_adapter(url=request.url)
  File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 728, in get_adapter
    raise InvalidSchema("No connection adapters were found for {!r}".format(url))
requests.exceptions.InvalidSchema: No connection adapters were found for '"https://www.cell.com/cell-host-microbe/pdf/S1931-3128(20)30072-X.pdf'

Any clues?