This is my personal project created to download images from Pixiv website. The program will grab the original resolution images, including images in manga and ugoira, from specified users to specified download directory.
Viewing restriction
in your Pixiv account User settings
install Python 3.6+
install requests
library
pip install --user requests
edit config.json
file in data
folder manually or via command line interface
save directory
: the save directory pathusers
: the user ID shown in URLdisplay help message
$ python main.py -h
usage: main.py [-h] [-f FILE] [-l] [-u USERNAME] [-p PASSWORD] [-s SAVE_DIR]
[-t THREADS]
{artwork,bookmark,ranking} ...
positional arguments:
{artwork,bookmark,ranking}
artwork download artworks from user IDs specified in "users"
field
bookmark download bookmark artworks from user IDs specified in
"bookmarks" field
ranking download top N ranking artworks based on given
conditions
optional arguments:
-h, --help show this help message and exit
-f FILE load file for this instance (default:
data/config.json)
-l list current settings
-u USERNAME set username
-p PASSWORD set password
-s SAVE_DIR set save directory path
-t THREADS set number of threads for this instance
display artwork
help message
$ python main.py artwork -h
usage: main.py artwork [-h] [-a [ID ...]] [-d all [ID ...]] [-c all [ID ...]]
optional arguments:
-h, --help show this help message and exit
-a [ID ...] add user IDs
-d all [ID ...] delete user IDs and their directories
-c all [ID ...] clear directories of user IDs
display ranking
help message
$ python main.py ranking -h
usage: main.py ranking [-h] -m MODE -c CONTENT -d YYYYMMDD [-n N]
optional arguments:
-h, --help show this help message and exit
-m MODE modes: {daily, weekly, monthly, rookie, original, male, female,
daily_r18, weekly_r18, male_r18, female_r18}
-c CONTENT contents: {all, illust, ugoira, manga}
-d YYYYMMDD date
-n N get top N artworks (default: 20)
download artworks from user IDs stored in config file; update users' artworks if directories already exist
python main.py artwork
download bookmark artworks from user IDs stored in config file; update users' artworks if directories already exist
python main.py bookmark
delete user IDs and their directories (IDs in users
field + artwork directories), then download artworks for remaining IDs in config file
python main.py artwork -d 63924 408459
add user IDs (IDs in bookmarks
field) then download bookmark artworks for newly added IDs + IDs in config file
python main.py bookmark -a 63924 408459 2188232
load temp.json
file in data
folder (only for this instance), add user IDs to that file, then download artworks from IDs in that file
python main.py artwork -f data/temp.json -a 63924 408459 2188232
clear directories for all user IDs in config file, set threads to 24, then download artworks (i.e. re-download artworks)
python main.py artwork -c all -t 24
Pixiv uses AJAX request to generate content dynamically, so parsing plain HTML will not work
Preserve log
to have better observationsometimes the requests
module will close the program with error Remote end closed connection without response
. I am not sure the exact cause, but it is most likely due to the high amount of requests sent from the same IP address in a short period of time; hence the server closes the connection
session
to download images and allow session.get
to retry in case of ConnectionError
exception using HTTPAdapter
and Retry
packagesupdate mechanism
Attempt 1: download artworks from newest to oldest until an existing file is found on the disk. This does not work well with the multi-threading implementation, as it makes the program a lot more complicated in order to deal with thread stopping condition
Attempt 2: record the last visited artwork information for each user to check if update is needed. This does not work if the newest upload was deleted by the user, as the stored information cannot be found in the retrieved HTML. One solution is to record a list of all downloaded artwork information for each user, then compare it with the parsed data, but this wastes a lot of unnecessary space and memory
Solution: find the file names while parsing the artwork IDs such that the former can be used to compare with the existing files on disk. If there is a match, then the function will return a list of artworks from newest to the point in which the match was found
folder name inconsistency. I originally planned to use user names as the subdirectory names, but there are two problems with this approach: (1) some names are invalid (e.g. containing special characters), and (2), if the users change their names on Pixiv, the program will re-download all contents of the users and leave two directories pointing to the same users
login verification. The program runs fine most of the time, but I have encountered a few times where the program failed to run due to authentication error. This is caused by the reCAPTCHA v3
verification, and I have yet to figure out a way to bypass it