mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.29k stars 921 forks source link

Option to preserve full HTML pages of submissions #1505

Open Athari opened 3 years ago

Athari commented 3 years ago

Is there an option to preserve full HTML pages?

While it's usually unnecessary, in some cases comments on submission pages can contain important information. In case of random gallery nukes, comments can help find the artist. Sometimes comments contain links to alternative versions/dumps, alt artist names and other important info. When dumping my own gallery, I'd like to have a full backup in case admins of a website decide to nuke my gallery.

I assume a lot (most?) of scrapers download HTML pages, but I don't see an option to save them.

rautamiekka commented 3 years ago

gallery-dl puts API 1st, and even in the case of DeviantArt's scraps folders where the cookies are required, everything's done through the official API.

Thus, most of the software doesn't download web pages, but most API's do provide comments as part of the returned JSON or whatever the site returns (or if separately told to as a config option), so you'll need to use the --dump-json and maybe --write-pages options. Remember to redirect the standard output (stdout) of gallery-dl to a file cuz you're bound to get big, if not massive, amounts of text.

^ For ex:

gallery-dl --quiet --dump-json "link" > jsonfile.json
TestPolygon commented 3 years ago

You can generate HTML manually for each file and fill it with the data from JSON (use -K to list them) with postprocessors option.

The quick written example for DA:

"postprocessors": [{
    "name": "metadata",
    "mode": "custom",
    "format": "{description}\n\n{tags}\n\n"
}, {
    "directory": "htmls",
    "extension": "html",
    "name": "metadata",
    "mode": "custom",
    "format": "<h1 style='display: inline'><a href='{url}'>{title}</a></h1> by <a href='https://www.deviantart.com/{username}'>{author[username]}</a><div><br></div><div>{description}</div><br><div><hr>[\"{tags:J\", \"}\"]<hr></div><div>{date:%Y.%m.%d}</div><br>\n\n"
}]

For this: https://www.deviantart.com/roblfc1892/art/skogafoss-874326148

The result (formatted) is:

<h1 style='display: inline'>
    <a href='https://www.deviantart.com/roblfc1892/art/skogafoss-874326148'>skogafoss</a>
</h1> by <a href='https://www.deviantart.com/roblfc1892'>roblfc1892</a>
<div><br></div>
<div>
    <!-- HTML from JSON -->
    <span>skogafoss during the night....</span>
</div>
<br>
<div>
    <hr>
    ["aurora", "borealis", "iceland", "landscape", "night", "skogafoss"]
    <hr>
</div>
<div>2021.03.25</div>
<br>

It looks so: image


Try this config with this https://www.deviantart.com/nimiszu/art/What-We-Remember-868048585.


Killer feature

Also you can concat all these HTML with one simple bash command cat * > bundle.html.

The result:

image

Note: I put the different authors in one folder just for the better example.

bluerthanever commented 3 years ago

hey. I noticed that you can customize the metadata output recently and i think that's amazing. but i am wondering is there a way to create html pages with local file paths? like sometimes there would be both images and texts in a post or something, and i would like to enjoy the content in the order or in the way it is presented in browsers, not completely of course. does that mean that the extractor has to keep track of all downloaded files' local paths so you could use the keywords to replace the links in html context with local paths?

Athari commented 3 years ago

@bluerthanever You can use the same file name template in the generated HTML as the one you're using for naming files. So if you're using "filename": "{submission_id}_{filename}.{extension}", just put the same thing into the format of your HTML metadata postprocessor.

This may be problematic if a submission can contain any number of any files of any type, as just one <img> won't be enough. You may get away with conditional formatting in some cases, but in general, it may be much easier to just write a separate script which loads all JSONs and generates pretty HTMLs. JSON files being easy to work with in any programming language is the reason they're used, after all.

bluerthanever commented 3 years ago

@Athari

You can use the same file name template in the generated HTML as the one you're using for naming files. So if you're using "filename": "{submissionid}{filename}.{extension}", just put the same thing into the format of your HTML metadata postprocessor.

silly me. should have thought of it. i would look into this and decide whether i should use a new script. thanks a lot!!! XD