marked / yahoo-group-archiver

Scrapes and archives a Yahoo groups email archives, photo galleries and file contents using the non-public API
MIT License
2 stars 2 forks source link

yahoo-group-archiver
pipeline mode

git clone https://github.com/ArchiveTeam/yahoo-group-archiver.git
cd yahoo-group-archiver
Usage:

pip3 install -r requirements.txt
run-pipeline3 ./pipeline.py $USERNAME

Watch:
http://tracker-test.ddns.net/yahoo-groups-api/


yahoo-group-archiver
manual mode

git clone https://github.com/ArchiveTeam/yahoo-group-archiver.git
cd yahoo-group-archiver

Usage:

pip install -r requirements.txt
./yahoo.py -ct '<T_cookie>' -cy '<Y_cookie>' '<groupid>'

You will need to get the T and Y cookie values from an authenticated browser session.

The cookies should look like this, where someLongText and someShortText are arbitrary strings:

In Google Chrome these steps are required:

  1. Go to Yahoo Groups.
  2. Click the 🔒 padlock, or ⓘ (circled letter i) in the address bar to the left of the website address.
  3. Click "Cookies".
  4. On the Allowed tab select "Yahoo.com" followed by "Cookies" in the tree listing.
  5. Select the T cookie, right click the value in the Content field and Select All. Then copy the value and paste in place of the <T_cookie> in the above command line.
  6. Select the Y cookie, right click the value in the Content field and Select All. Then copy the value and paste in place of the <Y_cookie> in the above command line.

In Firefox:

  1. Go to Yahoo Groups (make sure you're signed in with your account).
  2. Press Shift-F9 or select the menu Tools/Web Developer/Storage Inspector.
  3. Double click on the T cookie's value and copy the content in place of <T_cookie> in the above command line.
  4. Double click on the Y cookie's value and copy the content in place of <Y_cookie> in the above command line.

Note: the string you paste must be surrounded by quotes.

Using the --cookie-file (or -cf) option allows you to specify a file in which the authentication cookies will be loaded and saved in.

Files will be placed into the directory structure groupname/{email,files,photos,databases}

Command Line Options


usage: yahoo.py [-h] [-ct COOKIE_T] [-cy COOKIE_Y] [-ce COOKIE_E]
                [-cf COOKIE_FILE] [-e] [-at] [-f] [-i] [-t] [-r] [-d] [-l]
                [-c] [-p] [-a] [-m] [-o] [--user-agent USER_AGENT]
                [--start START] [--stop STOP] [--ids IDS [IDS ...]] [-w] [-v]
                [--colour] [--delay DELAY]
                group

positional arguments:
  group

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose
  --colour, --color     Colour log output to terminal
  --delay DELAY         Minimum delay between requests (default 0.2s)

Authentication Options:
  -ct COOKIE_T, --cookie_t COOKIE_T
                        T authentication cookie from yahoo.com
  -cy COOKIE_Y, --cookie_y COOKIE_Y
                        Y authentication cookie from yahoo.com
  -ce COOKIE_E, --cookie_e COOKIE_E
                        Additional EuConsent cookie is required in EU
  -cf COOKIE_FILE, --cookie-file COOKIE_FILE
                        File to store authentication cookies to. Cookies
                        passed on the command line will overwrite any already
                        in the file.

What to archive:
  By default, all the below.

  -e, --email           Only archive html and raw email and attachments (from email) 
                        through the messages API
  -at, --attachments    Only archive attachments (from attachments list)
  -f, --files           Only archive files
  -i, --photos          Only archive photo galleries
  -t, --topics          Only archive HTML email and attachments through 
                        the topics API
  -r, --raw             Only archive raw email without attachments through 
                        the messages API
  -d, --database        Only archive database
  -l, --links           Only archive links
  -c, --calendar        Only archive events
  -p, --polls           Only archive polls
  -a, --about           Only archive general info about the group
  -m, --members         Only archive members
  -o, --overwrite       Overwrite existing files such as email and database
                        records
  -na, --noattachments  Skip attachment downloading as part of topics and e-mails

Request Options:
  --user-agent USER_AGENT
                        Override the default user agent used to make requests

Message Range Options:
  Options to specify which messages to download. Use of multiple options
  will be combined. Note: These options will also try to fetch message IDs
  that may not exist in the group.

  --start START         Email message id to start from (specifying this will
                        cause only specified message contents to be
                        downloaded, and not message indexes). Default to 1, if
                        end option provided.
  --stop STOP           Email message id to stop at (inclusive), defaults to
                        last message ID available, if start option provided.
  --ids IDS [IDS ...]   Get email message by ID(s). Space separated,
                        terminated by another flag or --

Output Options:
  -w, --warc            Output WARC file of raw network requests. [Requires
                        warcio package installed]