sevensins / Wallbase-Downloader

Shell script to leech wallpapers from wallbase.cc
http://www.4geeksfromnet.com/2011/01/wallbase-download-wallpapers-the-easy-way.html
46 stars 12 forks source link

Question: Download ALL the wallpapers from wallbase #13

Open dominique120 opened 10 years ago

dominique120 commented 10 years ago

Since wallbase its on it deathbed I want to download all the wallpapers, how can I do that with this script? Could I use the WP_RANGE_START and WP_RANGE_STOP from 1 to 99999999 or something(I'm assuming the script page will skip if it finds 404/403 errors)

Thanks,

@sevensins @numn @macearl

dominique120 commented 10 years ago

Also, could wget run parallel like 10 times or something so that it does not have to download one image at a time?

I could do this or execute he script 10 times and play with the ranges I mentioned above.

macearl commented 10 years ago

The script does skip not found images and it should work with wp range, running it multiple times works if you run it in diferrent directories, if you run it in the same directory the temp files would override themselves or add tmp.1 and so on.

But im not sure if you really want to download all the wallpapers, there are over 3 million ;)

dominique120 commented 10 years ago

@macearl Yeah I optimized the script a bit(brought it down to less than 200 lines) and I'm running it with GNU parallel. I split the 3 million image range into 60 scripts(each script downloads 50k images) and they are all executing at once.

There are extra optimizations that could be done like not use wget and use a faster downloader(like axel or aria2), the URL finding section could be optimized(I would like to have it reversed so I can run another 60 scripts going backwards), we could use parallel to task wget X times over to download quicker and not have to run the entire URL generating section.

Also a shared cache file(the downloaded.txt file) would aid this substantially if we are going to use parallel and task wget with fetching stuff.

macearl commented 10 years ago

well it certainly sounds interesting, i never used gnu parallel or any other download managers,

i would like to take a look at your version, if you dont mind sharing it ;)

dominique120 commented 10 years ago

Sure thing, I'll write up a little text explaining what I did also.

dominique120 commented 10 years ago

@macearl Here is the script:

#!/bin/bash

# See Section 13
# Enter your Username
USER="user"
# Enter your password
PASS="pass"

PURITY=111
# For accepted values of topic see Section 4
TOPIC=23
# For download location see Section 7
LOCATION=/home/wallbase/dl/10
# For Types see Section 9
TYPE=1
# See Section 15
CATEGORIZE=1
# See Section 16
WP_RANGE_START=1800001
WP_RANGE_STOP=1850000

# if wished categorize the downloads
# by their PURITY(nsfw,sfw,sketchy) 
# and TOPIC (manga, hd, general)
if [ $CATEGORIZE -gt 0 ]; then
    LOCATION="$LOCATION/$PURITY/$TOPIC"
fi

if [ ! -d $LOCATION ]; then
    mkdir -p $LOCATION
fi

cd $LOCATION

login() {
    # checking parameters -> if not ok print error and exit script
    if [ $# -lt 2 ] || [ $1 == '' ] || [ $2 == '' ]; then
        echo "Please check the needed Options for NSFW/New Content (username and password)"
        echo ""
        echo "For further Information see Section 13"
        echo ""
        echo "Press any key to exit"
        read
        exit
    fi
    nice -n -20 wget --keep-session-cookies --save-cookies=cookies.txt --referer=http://wallbase.cc/home http://wallbase.cc/user/login
    csrf="$(cat login | grep 'name="csrf"' | sed  's .\{44\}  ' | sed 's/.\{2\}$//')"
    ref="$(rawurlencode $(cat login | grep 'name="ref"' | sed  's .\{43\}  ' | sed 's/.\{2\}$//'))" 
    nice -n -20 wget --load-cookies=cookies.txt --keep-session-cookies --save-cookies=cookies.txt --referer=http://wallbase.cc/user/login --post-data="csrf=$csrf&ref=$ref&username=$USER&password=$PASS" http://wallbase.cc/user/do_login
} 

rawurlencode() {
    local string="${1}"
    local strlen=${#string}
    local encoded=""

    for (( pos=0 ; pos<strlen ; pos++ )); do
        c=${string:$pos:1}
        case "$c" in
            [-_.~a-zA-Z0-9] ) o="${c}" ;;
            * )               printf -v o '%%%02x' "'$c"
        esac
            encoded+="${o}"
    done
    echo "${encoded}"
} 

# login only when it is required ( for example to download favourites or nsfw content... )
if [ $PURITY == 001 ] || [ $PURITY == 011 ] || [ $PURITY == 111 ] || [ $TYPE == 5 ] || [ $TYPE == 7 ] ; then
   login $USER $PASS
fi

if [ $WP_RANGE_STOP -gt 0 ]; then
    #WP RANGE
    for (( count= "$WP_RANGE_START"; count< "$WP_RANGE_STOP"+1; count=count+1 ));
    do
        if cat /home/wallbase/download_list | grep "$count" >/dev/null
            then
                echo "File already downloaded!"
            else
                echo $count >> /home/wallbase/download_list
                nice -n -20 wget --no-dns-cache -4 --tries=2 --keep-session-cookies --load-cookies=cookies.txt --referer=wallbase.cc http://wallbase.cc/wallpaper/$count
                cat $count | egrep -o "http://wallpapers.*(png|jpg|gif)" | nice -n -20 wget --no-dns-cache -4 --tries=2 --keep-session-cookies --load-cookies=cookies.txt --referer=http://wallbase.cc/wallpaper/$number -i -   
        fi
        done
else
    echo error in TYPE please check Variable
fi

rm -f cookies.txt login do_login

This script is the 10th(out of 60) script in the forward series. There are another 60 scripts that count backwards with this line in the forloop:

for (( count="$WP_RANGE_STOP"; count >= "$WP_RANGE_START"; count=count-1 ));

All the 120 scripts have blocks of 50k wallpapers to download.

Also, I took a little long to post this because I wanted to slim it down even more and to work on the shared downloaded files list. As you can see, all 120 scripts use the same file to check for downloaded files, I like this because I have two sets going against each other so I dont want to download duplicates.

To execute this I used GNU Parallel working in non GNU mode(no --gnu flag):

nice -n -20 parallel -j 1000 < list

You may have noticed that I nice everything to -20, this is because I noticed that after a few hours things would slow down, this helps, but does not solve the problem.

Another issue I faced was counting the ammount of downloaded wallpapers. To get a rough estimate I created this script(its suited to my needs so it might need some changes to use elsewhere)

normal=$(find /home/wallbase/dl -type f | wc -l)
reverse=$(find /home/wallbase/dlR -type f | wc -l)
num=`expr $reverse + $normal`
printf $num

All scripts have permissions set to 755(with chmod 755 *.sh) to allow execution by parallel.

I also had to change IPs(I'm on a dedicated server and it has some failover IPs) because I noticed timeouts and drops and problems.

Every 24-48 hours I restart the VM because it maxes its cache memory and slows down even more. Once I restart the process it takes a few minutes to check all the downloaded wallpapers. It maxes CPU usage :P

zirdbai

Till now, I've queried around 400k wallpapers and downloaded some 150k wallpapers(most are 403 or 404(403s are marked for deletion)) that ocupy around 70GBs.

If you have any questions let me know, if you want to post the counting script and my slimmer version go ahead.

dominique120 commented 10 years ago

Also, I'm not usre of this though, but I believe it ould be best to echo $count after downloading the file because sometimes it will fail to download but the number is added to the list.

We could do this with an if check to check the exit code of wget. Here is an example(I use this with git but maybe we can do something similar)

    git clone https://github.com/PrinterLUA/FORGOTTENSERVER-ORTS #--quiet
    success=$?

    if [[ $success -eq 0 ]]; then
dominique120 commented 10 years ago

I updated the counting script to do a few more things.

normal=$(find /home/wallbase/dl -type f | wc -l)
reverse=$(find /home/wallbase/dlR -type f | wc -l)
queries=$(wc -l /home/wallbase/download_list | awk '{print $1}')
space=$(du -sh /home/wallbase/ | awk '{print $1}')

num=`expr $reverse + $normal`

printf "Wallpapers so far(not exact): $num\n"
printf "Queries so far: $queries\n"
printf "Used space: $space\n"