raffaem / cs-dlp

Script for downloading Coursera.org videos and naming them.
GNU Lesser General Public License v3.0
328 stars 50 forks source link

cs-dlp

This is a fork of coursera-dl that works with modern Python and modern coursera.org, with added features and patches.

Introduction

This script makes it easier to batch download lecture resources (e.g., videos, ppt, etc) for Coursera classes. Given one or more class names, it obtains week and class names from the lectures page, and then downloads the related materials into appropriately named files and directories.

This work was originally inspired in part by [youtube-dl][3] by which I've downloaded many other good videos such as those from Khan Academy.

Features

Disclaimer

cs-dlp is meant to be used only for your material that Coursera gives you access to download.

We do not encourage any use that violates their Terms Of Use. A relevant excerpt:

"[...] Coursera grants you a personal, non-exclusive, non-transferable license to access and use the Sites. You may download material from the Sites only for your own personal, non-commercial use. You may not otherwise copy, reproduce, retransmit, distribute, publish, commercially exploit or otherwise transfer any material, nor may you modify or create derivatives works of the material."

Installation instructions

cs-dlp requires Python 3 and a Coursera account enrolled in the class of interest.

Note: cs-dlp is not compatible with Python 2.

On any operating system, ensure that the Python executable location is added to your PATH environment variable and, once you have the dependencies installed (see next section), for a basic usage, you will need to invoke the script from the main directory of the project and prepend it with the word python. You can also use more advanced features of the program by looking at the "Running the script" section of this document.

Note: You must already have (manually) agreed to the Honor of Code of the particular courses that you want to use with cs-dlp.

Installing from source

From a command line (preferably, from a virtual environment), simply issue the command:

git clone https://github.com/raffaem/cs-dlp
cd cs-dlp
python -m pip install --user .

Note 1: We strongly recommend that you don't install the package globally on your machine (i.e., with root/administrator privileges), as the installed modules may conflict with other Python applications that you have installed in your system. Prefer to use the option --user to pip install.

ArchLinux

cs-dlp does not currently have an AUR package. Help welcome!

Create an account with Coursera

If you don't already have one, create a [Coursera][1] account and enroll in a class. See https://www.coursera.org/courses for the list of classes.

Authenticating

To authenticate with Coursera, you need a CAUTH cookie.

There are currently two supported ways to do so: you can have cs-dlp get it automatically from your browser, or you can pass one manually.

  1. Automatic way

    1. Open your favorite browser and login into Coursera
    2. Call cs-dlp with --cauth-auto browser option.

      Valid options for browser are:

      • chrome for Google Chrome
      • chromium
      • opera
      • opera_gx
      • brave
      • edge
      • vivaldi
      • firefox
      • librewolf
      • safari
  2. Manual way

    Pass a CAUTH cookie to the --cauth option.

Running the script

Refer to cs-dlp --help for a complete, up-to-date reference on the runtime options supported by this utility.

Run the script to download the materials by providing your Coursera CAUTH cookie, the class names, as well as any additional parameters:

cs-dlp --cauth-auto chrome modelthinking-004

Here are some examples of how to invoke cs-dlp from the command line:

    Multiple classes:            cs-dlp --cauth-auto chrome saas historyofrock1-001 algo-2012-002
    Filter by section name:      cs-dlp --cauth-auto chrome -sf "Chapter_Four" crypto-004
    Filter by lecture name:      cs-dlp --cauth-auto chrome -lf "3.1_" ml-2012-002
    Download only ppt files:     cs-dlp --cauth-auto chrome -f "ppt" qcomp-2012-001
    Get the preview classes:     cs-dlp --cauth-auto chrome -b ni-001
    Download videos at 720p:     cs-dlp --cauth-auto chrome --video-resolution 720p ni-001
    Specify download path:       cs-dlp --cauth-auto chrome --path=C:\Coursera\Classes\ comnetworks-002
    Display help:                cs-dlp --help

    Maintain a list of classes in a dir:
      Initialize:              mkdir -p CURRENT/{class1,class2,..classN}
      Update:                  cs-dlp -n --path CURRENT `\ls CURRENT`

Note: If your ls command is aliased to display a colorized output, you may experience problems. Be sure to escape the ls command (use \ls) to assure that no special characters get sent to the script.

Note that we do support the New Platform ("on-demand") courses.

By default, videos are downloaded at 540p resolution. For on-demand courses, the --video-resolution flag accepts 360p, 540p, and 720p values.

To download just the .txt and/or .srt subtitle files instead of the videos, use --ignore-formats mp4 --subtitle-language en or whatever format the videos are encoded in and desired languages for subtitles.

If you want to store your preferred parameters, create a file named coursera-dl.conf where the script is supposed to be executed, with the following format:

    --subtitle-language en,zh-CN|zh-TW
    --download-quizzes
    #--mathjax-cdn https://cdn.bootcss.com/mathjax/2.7.1/MathJax.js
    # more other parameters

Parameters which are specified in the file will be overriden if they are provided again on the commandline.

Note: In coursera-dl.conf, all the parameters should not be wrapped with quotes.

Resuming downloads

In default mode when you interrupt the download process by pressing CTRL+C, partially downloaded files will be deleted from your disk and you have to start the download process from the beginning. If your download was interrupted by something other than KeyboardInterrupt (CTRL+C) like sudden system crash, partially downloaded files will remain on your disk and the next time you start the process again, these files will be discarded from download list!, therefore it's your job to delete them manually before next start. For this reason we added an option called --resume which continues your downloads from where they stopped:

cs-dlp --cauth-auto chrome --resume sdn1-001

This option can also be used with external downloaders:

cs-dlp --cauth-auto chrome --wget --resume sdn1-001

Note 1: Some external downloaders use their own built-in resume feature which may not be compatible with others, so use them at your own risk.

Note 2: Remember that in resume mode, interrupted files WON'T be deleted from your disk.

Troubleshooting

If you have problems when downloading class materials, please try to see if one of the following actions solve your problem:

China issues

If you are from China and you're having problems downloading videos, adding "52.84.167.78 d3c33hcgiwev3.cloudfront.net" in the hosts file (/etc/hosts) and freshing DNS with "ipconfig/flushdns" may work (see https://github.com/googlehosts/hosts for more info).

Found 0 sections and 0 lectures on this page

First of all, make sure you are enrolled to the course you want to download.

Many old courses have already closed enrollment so often it's not an option. In this case, try downloading with --preview option. Some courses allow to download lecture materials without enrolling, but it's not common and is not guaranteed to work for every course.

Finally, you can download the videos if you have, at least, the index file that lists all the course materials. Maybe your friend who is enrolled could save that course page for you. In that case use the --process_local_page option.

Alternatively you may want to try this various browser extensions designed for this problem.

If none of the above works for you, there is nothing we can do.

Download timeouts

cs-dlp supports external downloaders but note that they are only used to download materials after the syllabus has been parsed, e.g. videos, PDFs, some handouts and additional files (syllabus is always downloaded using the internal downloader). If you experience problems with downloading such materials, you may want to start using external downloader and configure its timeout values. For example, you can use aria2c downloader by passing --aria option:

cs-dlp --cauth-auto chrome --path . --aria2  <course-name>

And put this into aria2c's configuration file ~/.aria2/aria2.conf to reduce timeouts:

connect-timeout=2
timeout=2
bt-stop-timeout=1

Timeout configuration for internal downloader is not supported.

Windows: proxy support

If you're on Windows behind a proxy, set up the environment variables before running the script as follows:

set HTTP_PROXY=http://host:port
set HTTPS_PROXY=http://host:port

Related discussion: #205

Alternative CDN for MathJax.js

When saving a course page, we enabled MathJax rendering for math equations, by injecting MathJax.js in the header. The script is using a cdn service provided by mathjax.org. However, that url is not accessible in some countries/regions, you can provide a --mathjax-cdn <MATHJAX_CDN> parameter to specify the MathJax.js file that is accessible in your region.

Reporting issues

Before reporting any issue please follow the steps below:

  1. Verify that you are running the latest version of the script

  2. If the problem persists, feel free to open an issue in our bugtracker, please fill the issue template with as much information as possible.

Donations

You can support the project by sponsoring me: