mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
10.68k stars 881 forks source link

How does Gallery-dl work? #5764

Closed AFGreeneye closed 1 week ago

AFGreeneye commented 1 week ago

Hi! I'm not sure if this is the right place to ask my question, but I really want to understand how Gallery-dl works. I'm currently studying frontend development, and I'm quite proficient with JavaScript/TypeScript (though I'm still a newbie!). Lately, scraping data from the internet has become my new hobby. Today, I came across the website 'nsfwalbum.com'. I tried to download all albums of a model, but Gallery-dl only accepts the URL of a single album.

I could create a list of albums I want to download and put them in a '.bat' file to run it, but that wouldn't be convenient and would take a lot of time. So, I thought about making my own picture downloader app just for fun. Since 'nsfwalbum.com' uses Dynamic Content Loading, I had to use Puppeteer (a Headless Browser) in my project, which works fine!

However, when I compare my project with Gallery-dl, mine would never be able to compete in terms of speed and efficiency. Unfortunately, I'm new to Python and I looked at 'nsfwalbum extractor code but barely understand the syntaxes. I really want to know how Gallery-dl downloads pictures so quickly. Does it use a Headless Browser or something similar? I wish someone could explain it to me.

mikf commented 1 week ago

There is no Headless Browser, only reverse engineering (if one can call it that) how a site works by inspecting its traffic using a network monitor and HTML and JS source if necessary. gallery-dl replicates the needed HTTP requests and extracts data to collect and build download URLs.

For nsfwalbum search results, you'd need to request https://nsfwalbum.com/backend.php?queryString=search=QUERY&prev_items=4&p=PAGE and collect the returned album IDs (href="/album/12345"), it seems.

AFGreeneye commented 1 week ago

My project works fine, but the only problem I have is with something called 'spirit' to save the JPG file

https://nsfwalbum.com/backend.php?&spirit=g6z27zb4zc6zb7z%605ze5z5dz&photo=85443691

AFGreeneye commented 1 week ago

The variable called 'spirit' in the JavaScript code of the page holds the spirit's value. The problem is that I tried everything to extract the value with 'Axios' and 'Cheerio', but it does not work! The funny thing is that you can easily see the spirit's value by using 'console.log(spirit)' in the browser.

mikf commented 1 week ago
var spirit = encodeURIComponent(giraffe.annihilate("1e|4c|4e|0e|2e|2e|6a|7a|", 6));

from https://nsfwalbum.com/iframe_image.php?id=12345

var giraffe={annihilate:function(r,a){var n="";r.toString();for(var t=0;t<r.length;t++){var e=r.charCodeAt(t)^a;n+=String.fromCharCode(e)}return n}}

from https://nsfwalbum.com/js/my.js

Translating this to Python gets you https://github.com/mikf/gallery-dl/blob/f58b0e6fc7972e1432fa7032afddfb108802a8a1/gallery_dl/extractor/nsfwalbum.py#L52-L54 and https://github.com/mikf/gallery-dl/blob/f58b0e6fc7972e1432fa7032afddfb108802a8a1/gallery_dl/extractor/nsfwalbum.py#L79-L83

AFGreeneye commented 1 week ago

Thank you so much!

import axios from 'axios';

async function fetchSpiritValue() {
    const url = 'https://nsfwalbum.com/photo/85440023';

    try {
        // Make GET request to the URL
        const response = await axios.get(url);

        // Extract the part between 'giraffe.annihilate("' and '"'
        const startIndex = response.data.indexOf('giraffe.annihilate("');
        const endIndex = response.data.indexOf('"', startIndex + 'giraffe.annihilate('.length + 1);
        const extractedString = response.data.substring(startIndex + 'giraffe.annihilate('.length, endIndex);

        // Define the equivalent of _annihilate function
        function _annihilate(value: string, base: number = 6): string {
            let result = '';
            for (let i = 0; i < value.length; i++) {
                const charCode = value.charCodeAt(i) ^ base;
                result += String.fromCharCode(charCode);
            }
            return result;
        }

        // Apply _annihilate to the extracted string
        let spirit = _annihilate(extractedString);

        // Replace special characters if needed
        spirit = spirit.replace(/`/g, '%60'); // Replace backtick (`) with %60

        // Encode the spirit value to ensure proper URL encoding
        spirit = encodeURIComponent(spirit);

        // Check and remove leading %24 if present
        if (spirit.startsWith('%24')) {
            spirit = spirit.substring(3); // Remove the first 3 characters (%24)
        }

        console.log('Spirit value:', spirit);
    } catch (error: any) {
        console.error('Error fetching spirit value:', (error as Error).message);
    }
}

fetchSpiritValue();