mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.85k stars 975 forks source link

[deviantart] [question] Questions regarding re-downloading images that have been changed #2846

Open a-washing-machine opened 2 years ago

a-washing-machine commented 2 years ago

ORIGINAL TITLE: [deviantArt] [question] Is it currently possible to "overwrite existing file ONLY IF image resolution of new file is higher"?


Right up front: This is NOT a feature request, just an inquiry if such a feature already exists.

If it does, cool, I'm gonna use it right away. If not, okay, cool too, then I'm not missing out on an opportunity to update my downloads. :)

The only reason I'm even asking the question is, because I think I messed up at some point with my abort-parameter, and now have to re-run my entire gallery-download without abort parameters to make sure I didn't miss anything...

...and I vaguely recall reading something in here in passing about a feature like this existing (which would be sensible to use right now if it does), though I may have misread that at the time.

(It may have been related to something about "suddenly being able to download ordinarily inaccessible full resolution images due to some beneficial oversight in the new deviantArt API, which deviantArt unfortunately may 'fix' at some point.")

I've read about compare.action and compare.shallow (though I have not actually used them yet), for comparing if a newly downloaded file would have a different file size than the one already saved.

This seems like overkill for my needs, as I've seen deviantArt occasionally change their file compression for at least some images, which of course changes their file size irregardless of any changes made by the artist or what the new/old image resolution is. (Heck, I've seen some PNGs become JPGs... or was that the other way around?) So just going by file size differences, this would download way too many files.


As for why I have to re-parse my entire gallery downloads...

Pro-tip: Do not download the sub-galleries before the main gallery, because images linked in their descriptions getting downloaded to the main gallery folder may preemptively trigger an abort when downloading the actual main gallery later. -_-

mikf commented 2 years ago

Well, no, this is not possible with gallery-dl.

There is no way to read or compare the resolution of already downloaded images. gallery-dl doesn't even have a concept of "image" in the first place, there are only byte streams that just so happen to be images most the time.

The only way to currently compare files is the compare post processor, but that, as you already said, has its many shortcomings.

(It may have been related to something about "suddenly being able to download ordinarily inaccessible full resolution images due to some beneficial oversight in the new deviantArt API, which deviantArt unfortunately may 'fix' at some point.")

issue #293 and commit 02a247f4 still hasn't been "fixed" on dA from what I can tell

pxssy commented 2 years ago

I'm not sure if this is an alternative, but you could set filename include a modified date, and should they change the image, modified date should alter and you will not have an overlap.

right now i believe gallery-dl just checks if a file of identical name exists/in the archive. If you have a different name, it interprets it as a completely different file, even if the hash might be identical.

you can then probably run a jdupes to summarily remove all the dupes.

rautamiekka commented 2 years ago

I'm not sure if this is an alternative, but you could set filename include a modified date, and should they change the image, modified date should alter and you will not have an overlap.

That should be done when a website supports modifying an existing upload, anyway.

GrennKren commented 2 years ago

You can probably use postprocessor exec with an init or prepare event. And for the command option values, you try to execute some command to compare metadata of image width with that already downloaded.

If they are larger, then you remove that file already exists.

a-washing-machine commented 2 years ago

I'm not sure if this is an alternative, but you could set filename include a modified date, and should they change the image, modified date should alter and you will not have an overlap.

Thanks for the suggestion, but that would require me not just re-parsing everything - but actually re-downloading everything from scratch. I... don't have enough space to do that for what by now would be the third time. XD

I'm starting to run a bit low on hard drive space actually. Not sure how much space downloading even only the "file size different" images will take up, that part's a bit concerning. O_O

Well, no, this is not possible with gallery-dl.

Okay! Now I know. :)

Come to think of it, there might be another way I could still do this in post with tools I've already got... I need to do some tests with compare.action / compare.equal first, though.

I tried this, but it didn't produce any re-downloads when I deliberately edited the files to trigger a "file size difference" check:

gallery-dl_1.22.4.exe --config enumerate_filesize_difference_test.conf https://www.deviantart.com/ARTIST_NAME/gallery/?catpath=scraps

My config:

     "deviantart":
    {
        "cookies": "cookies.txt",
        "refresh-token": "-------REMOVED-------",
        "client-id": "-------REMOVED-------",
                "client-secret": "-------REMOVED-------",           
        "extra": true,
        "metadata": true,
        "blacklist": "foobar",
        "auto-watch": "true",
        "auto-unwatch":"true",

        "postprocessors": [{
            "name": "metadata",
            "mode": "custom",
            "format": "{description}\n",

    "compare.action": "enumerate",
    "compare.shallow": "true"               

        }]
    }

I'm assuming there is some obvious error I've made. Neither docs/gallery-dl-example.conf nor docs/gallery-dl.conf contain an example of this. Wasn't there a "full config example" somewhere?

Also, if I understood the documentation right, it'd produce something like this if a file size difference is detected:

deviantart_123456789_Artwork Title.png deviantart_123456789_Artwork Title.1.png

Would it be possible to specify in the config to do this instead?:

deviantart_123456789_Artwork Title.png deviantart_123456789_Artwork Title.MY_CUSTOM_TEXT.1.png

...but only for the "file size is different" case? A consistent string in the file name would be useful for extracting the "slightly different" files for my analysis tool, ".1." unfortunately also occurs in artwork titles, so it isn't ideal. :(

Not impossible to work around, just sub-optimal, so if that doesn't work that's okay.

a-washing-machine commented 2 years ago

@GrennKren

You can probably use postprocessor exec with an init or prepare event. And for the command option values, you try to execute some command to compare metadata of image width with that already downloaded.

If they are larger, then you remove that file already exists.

Oh, I didn't see this comment, my browser must've loaded a cached version of the page.

"exec - Execute external commands"

Ah. I'm assuming those would need to be in python. Which I unfortunately lack any knowledge in. ^_^# Java and C#, yes. Python? Nope. No time to learn that "at the moment", either. :(

compare metadata of image width with that already downloaded

That thought actually occurred to me too, but then I realized that the image downloaded may not always have the dimensions specified there. The actual behavior would need to be thoroughly tested of course if I went down this path, though I think I've already got a more easy-to-execute idea using tools I had already built earlier. ^_^#

a-washing-machine commented 2 years ago

@rautamiekka

That should be done when a website supports modifying an existing upload, anyway.

Well, if deviantart had some way to keep track when a file was last updated by the artist (i.e.: deliberately changed, not some change in file compression), and if that date is different from the upload date (or from the date it was downloaded to the system), then yeah, this might be a nice-to-have. (I myself have once or twice updated flash artwork slideshows to include more content, and I've seen a guy repeatedly upload in-progress art and update it from sketch to full-color image, though that's a bit of an outlier.)

Otherwise, if it was just "file size is different, Imma download new copy"...

well, I'm gonna tell you afterwards how that went. ^_^#

a-washing-machine commented 2 years ago

Okay, still struggling to get "action.compare" to download any "file size is different"-duplicates.

Now I've tried this in my config instead, still doesn't work:

      "deviantart":
    {
        "cookies": "cookies.txt",
        "refresh-token": "---------REMOVED--------------",
        "client-id": "---------REMOVED--------------",
            "client-secret": "---------REMOVED--------------",          
        "extra": true,
        "metadata": true,
        "blacklist": "foobar",
        "auto-watch": "true",
        "auto-unwatch":"true",

        "postprocessors": [
        {
                    "name": "metadata",
                    "mode": "custom",
                    "format": "{description}\n"
                },
        {
            "name": "compare",
            "action": "enumerate",
            "shallow": "true"
        }]
    }

gallery-dl_1.22.4.exe --config enumerate_filesize_difference_test.conf https://www.deviantart.com/ARTIST_NAME/gallery/?catpath=scraps

I don't see any examples using multiple post-processors in the provided example-config files.

I'm clearly making some blatantly obvious mistake here. What am I doing wrong?

a-washing-machine commented 2 years ago

@mikf Sorry to poke you, I'd falsely assumed thread-participants would get notified of new comments by default. Somehow never was an issue before. Huh. Ó_ò

In summary, I thought of a way to repurpose analysis tools I had already built myself to help me sort out the higher resolution images using the "download file if file size is different"-option gallery-dl provides, then just delete the unwanted files myself afterwards. :)

You can skip the earlier comments, this is the important bit:

...So, I eventually figured out that you need to add "skip": false for it to check for divergent file sizes. It was in the documentation, just not where I was looking! Ah. ^_^#

      "deviantart":
    {
        "cookies": "cookies.txt",
        "refresh-token": "-------REMOVED---------",
        "client-id": "-------REMOVED---------",
        "client-secret": "-------REMOVED---------",         
        "extra": true,
        "metadata": true,
        "blacklist": "foobar",
        "auto-watch": "true",
        "auto-unwatch":"true",

         "skip": false,

        "postprocessors": [
        {
            "name": "metadata",
            "mode": "custom",
            "format": "{description}\n"
        },
        {
            "name": "compare",
            "action": "enumerate",
            "shallow": "true"
        }]
    }

Uhm. Ah. So it has to download the whole file first just to compare if its size matches what's on-disk, before discarding identical files?

That seems... a tad inefficient. I mean... it's gonna take several months to update my whole download directory this way. ^_^##

Is there a better way to do this? :-/


As a side-query, I understand that naming convention for "duplicate downloads" is this:

deviantart_123456789_Artwork Title.png deviantart_123456789_Artwork Title.1.png

Would it be possible to specify in the config to do this instead?:

deviantart_123456789_Artwork Title.png deviantart_123456789_Artwork Title.MY_CUSTOM_TEXT.1.png

...but only for the "file size is different" case? A consistent string in the file name would be useful for extracting the "slightly different" files for my analysis tools, ".1." unfortunately also occurs in artwork titles, so it isn't ideal.

Not a big deal, just a minor inconvenience to work around, just asking if there's an option in the config for that already. If not, that's okay.

mikf commented 2 years ago

Sorry for the late reply. I do get notifications for everything in this repo, but I've been ... lets call it "busy".

Uhm. Ah. So it has to download the whole file first just to compare if its size matches what's on-disk, before discarding identical files?

Well yes, unfortunately that is necessary with compare. It would be better to just use the Content-Length header, but there is currently no interaction possible between downloader and post processor.

Would it be possible to specify in the config to do this instead?:

Not at the moment. This is hardcoded as a number, but it shouldn't be too hard to allow for a custom value: https://github.com/mikf/gallery-dl/blob/946643c23c8b094f6475461e96a6fbc887a8c366/gallery_dl/postprocessor/compare.py#L54


Ah. I'm assuming those would need to be in python.

No, those can be any executable program or your own shell script, although I don't know if Windows allows for anything other than .exe files. Maybe .bat and .ps, but I'm not sure.

Well, if deviantart had some way to keep track when a file was last updated

Other sites usually have values for created_time as well as updated_time, but not deviantart. deviantart only has a published_time timestamp, and Last-Modified HTTP header values are also not reliable.

a-washing-machine commented 2 years ago

Sorry for the late reply. I do get notifications for everything in this repo, but I've been ... lets call it "busy".

Ah! Sorry. Completely understandable, of course. Just nobody else was responding to my replies to their comments either (the pokes in those comments were edited in afterwards), figured something was wrong with the notification system or my understanding thereof.

Only goes to show your prior tendency for quick replies when the lack of such seemed odd! :)


Ah. I'm assuming those would need to be in python.

No, those can be any executable program or your own shell script

Interesting to know. :)

Out of curiosity, I'm guessing a config entry for that would look something like this?

"postprocessors": [
    {
        "name": "metadata",
        "mode": "custom",
        "format": "{description}\n"
    },
    {
        "name": "compare",
        "action": "enumerate",
        "shallow": "true"
    },
    {
        "name": "exec",
        "async"  : false,
        "command": ["C:/bla/bla/program.exe", "{parameter1}", "{parameter2}", "{parameter3}", ...]
        "event"  : "after"
    }
]

Would it be possible to specify in the config to do this instead?:

Not at the moment. This is hardcoded as a number, but it shouldn't be too hard to allow for a custom value

Yeah, I figured it was something like that. Like I said, not a big deal, I can work around that, it's just a minor nuisance to ignore the false-positives.


but there is currently no interaction possible between downloader and post processor.

That's unfortunate. :(

Just brainstorming a dumb idea here... It seems like it should be possible to modify the config to have gallery-dl, instead of saving the artwork description, instead save a file for each artwork containing certain data (content-length, artwork-page url, artwork filename, artwork download location [scraps, specific subfolder, main gallery folder]). And then afterwards a program could be built (by me) to parse through those output files and try to figure out if content-length matches with the corresponding artwork file already on disk, then trigger a call to a gallery-dl.exe instance with the parameters provided in the output file to download only those artworks where the size doesn't match, and into the desired directory.......

but this seems super-error prone and I don't know how much work this actually would end up being, just to do it this one time.

so I guess I'll run the "skip": false for the next 5 months or so instead then. ^_^#

...I do hope this doesn't end up eating up too much hard drive space though. ^_^##