mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.61k stars 950 forks source link

[deviantart][bug] PDF files no longer being downloaded, just their preview images #3781

Open a-washing-machine opened 1 year ago

a-washing-machine commented 1 year ago

DeviantArt must've changed something a few months ago, because gallery-dl no longer downloads PDF files, only their preview images. Last PDF-download I have is from October 13th, 2022.

Example:

gallery-dl_1.25.0.exe --config myConfig.conf https://www.deviantart.com/timsplosion/art/R-R2-March-of-the-Penguins-Storyboard-Part-Two-324286646

Tested with versions 1.25.0 (as well as older versions 1.23.4, 1.24.4, 1.24.5 a few months back)

Don't know if the same applies to other non-artwork file formats, though Flash-files and *.zip were fine last I checked on v1.24.5.

My config:

{
    "extractor":
    {
          "deviantart":
        {
            "cookies": "cookies.txt",
            "refresh-token": "[REDACTED]",
            "client-id": "[REDACTED]",
                "client-secret": "[REDACTED]",          
            "extra": true,
            "metadata": true,
            "blacklist": "foobar",
            "auto-watch": "true",
            "auto-unwatch":"true",

            "postprocessors": [{
                "name": "metadata",
                "mode": "custom",
                "format": "{description}\n"
            }]
        },

    ... ... ...
    }
}
mikf commented 1 year ago

This was already reported in https://github.com/mikf/gallery-dl/issues/3561. I still don't think this is fixable while still using the OAuth API.

a-washing-machine commented 1 year ago

Hmm. Trouble is, that isn't just one or two artists deliberately disabling the "download" option. It's the default behavior for PDF files.

Failing the current ability of gallery-dl to download PDFs, is there something I could do to take note of all PDF files it can't download now so I may perhaps download them in the future when that should become possible again?

I.e., "If artwork is PDF: write artwork URL to a file" (or at least do so with "artwork ID"). Then I could, in the future when this may be fixed, assemble a list of commands à la "gallery-dl_SomeFutureVersion.exe --config myConfig.conf https://www.deviantart.com/view/{PDF_file's_artwork_ID_here}" and run that. That way I don't miss anything now and don't have to fully re-parse every single gallery in my download queue later.

Any suggestions for how to accomplish that? Obviously it should only do this for artworks that are PDF files - or at least not for those that are obviously normal images.

ClosedPort22 commented 1 year ago

is there something I could do to take note of all PDF files it can't download now so I may perhaps download them in the future when that should become possible again?

I don't think so.

I.e., "If artwork is PDF: write artwork URL to a file"

The problem is precisely that it's not possible to determine if a deviation is PDF without making a request to the webpage (as opposed to the API), which is very inefficient.

ClosedPort22 commented 1 year ago

I still don't think this is fixable while still using the OAuth API.

@mikf This might be fixable by detecting the presence of /i/ in content['url'] (or the absence of token?). This way the number of requests could be greatly reduced. As far as I know, dA uses /i/ URLs for zip/rar/7z, pdf, swf and psd files.

This could also be used to improve original=image: if content['url'] contains /i/, assume it's not an image; if it doesn't, fall back to the current method.

Related: #3322

a-washing-machine commented 1 year ago

As of sometime between April 17th and May 2nd, deviantArt seems to have fixed this themselves, so pdf files get downloaded again. Tested in both 1.25.2 and 1.25.3.

I don't know if anything other than PDFs was affected, should the issue be closed?

ecophage commented 1 year ago

I'm not sure if this is still an issue with your pdfs, but I'm trying to download some and I'm only getting the cover image. Maybe it's because the pdf is text-based? Example (warning: NSFW): https://www.deviantart.com/tyvadi/art/Mazes-Medusae-Team-Green-Side-Story-Part-2-684017850

running gallery-dl on it only gives me the cover image.

Testing on your provided PDF downloaded the entire pdf so I'm not sure why gallery-dl's being selective.

a-washing-machine commented 1 year ago

Hmm. Tried your link and it worked for me, using the new 1.26.0 release. Maybe it was fixed in that, or it was a temporary hick up on deviantArt's end, they did just make some API updates around ... well, okay, probably closer to September 19th - 21st or so, not September 11th when you commented, but still.

a-washing-machine commented 1 year ago

Maybe it's because it's a "mature" pdf?

Hmm, judging by your github join date I'm guessing you're new to gallery-dl?

Do you have your own config file for gallery-dl, and did you set it up to connect to your deviantArt account? (or alternatively a new deviantArt account you made specifically for gallery-dl downloading)

I think for mature content (and to greatly reduce the amount of "429 Too Many Request" errors) you need to connect that, the setup is explained here:

https://github.com/mikf/gallery-dl/blob/master/docs/configuration.rst#api-tokens--ids

I know "register an application" sounds a bit daunting, but you don't even need to upload nor submit anything, just fill out a short form and hit "save" is enough.

Then run "gallery-dl.exe --config MY_CONFIG.conf oauth:deviantart" and it'll pop up a browser window with some text you should copy (though gallery-dl apparently also saves this automatically).

As for how to set up your config file... example config files: https://github.com/mikf/gallery-dl/blob/master/docs/gallery-dl.conf https://github.com/mikf/gallery-dl/blob/master/docs/gallery-dl-example.conf

Also my own config looks something like this, feel free to include / exclude the things you (don't) need:

{
    "extractor":
    {
          "deviantart":
        {
            "cookies": "cookies_TEXT_FILE_THAT_I_EXPORTED_WITH_A_CHROME_PLUGIN.txt",
            "refresh-token": "-----------------[REDACTED]----------------------",
            "client-id": "[REDACTED]",
            "client-secret": "-----------------[REDACTED]----------------------",           
            "extra": true,
            "metadata": true,
            "blacklist": "foobar",
            "auto-watch": "true",
            "auto-unwatch":"true",

            "postprocessors": [{
                "name": "metadata",
                "mode": "custom",
                "format": "{description}\n"
            }]
        },

        "twitter":
        {
            "username": "-----------------[REDACTED]----------------------",
            "password": "-----------------[REDACTED]----------------------!",
            "cookies": "twitter.com_cookies_TEXT_FILE_THAT_I_EXPORTED_WITH_A_CHROME_PLUGIN.txt",
            "cookies-update": true,
            "retweets": true,
            "quoted": true,
            "replies": true,
            "text-tweets": true,

            "directory": {
                "retweet_id"              : ["{category}", "{user[name]}", "Retweets", "{author[name]}"],
                "locals().get('quote_by')": ["{category}", "{user[name]}", "Quoted"  , "{author[name]}"],
                ""                        : ["{category}", "{user[name]}"]
            },

            "postprocessors": [
                 {
                    "name": "metadata",
                    "event": "post",
                    "filename": "{tweet_id}.txt",
                    "mode": "custom",
                    "content-format": "{content}",
                    "directory": "TEXT",
                    "archive": "./gallery-dl/twitterMetadataDownloadsArchive.db",
                    "skip":"true"
                 }
            ]

        },
        "tumblr":
        {
            "external": true
        }

    },

    "downloader":
    {
        "http":
        {
            "retries": 20
        }
    }
}

This also enables description download for deviantArt ( "postprocessors": [{ ... ), enables downloading embedded deviantArt images from descriptions and journals ( "extra": true ), and also does some extra stuff for twitter too like enable text-downloads ( "text-tweets": true,... ) and to separate retweets ( "directory": { ... / "postprocessors": [ ... ) from the account's own tweets. Also try a few more times than 5 retries since my internet is spotty sometimes. :P

Ask away if you have any questions. :)

ecophage commented 1 year ago

No dice. I already have a Deviantart Config set up (yoinked it off the examples page) that works fine for most NSFW. It's only this specific format that it seems to be outputting just the covers for.

config: `"deviantart": { "#": "download 'gallery' and 'scraps' images for user profile URLs", "include": "gallery,scraps,journal",

        "#": "use custom API credentials to avoid 429 errors",
        "cookies": "<folder path>/www.deviantart.com_cookies.txt",
        "client-id": "<the id>",
        "client-secret": "<very secret token>",
        "refresh-token": "<very secret refresh token>",

        "#": "put description texts into a separate directory",
        "metadata": true,
        "extra": true,
        "journals": "html",
        "mature": true,
        "postprocessors": [
            {
                "name": "metadata",
                "mode": "custom",
                "directory"       : "Descriptions",
                "content-format"  : "{description}\n",
                "extension-format": "descr.txt"
            }
        ]
    },`

I added the cookies portion by using This extension on chrome, opening up DA, and just grabbing the cookies as netscape format. IDK if that's the right way to get them. Adding that in and trying to redownload has not helped, it still just gives me the jpg for the cover image.

a-washing-machine commented 1 year ago

"Get cookies.txt locally" is the same extension I'm using too.

Hmm. Well I guess it's not as easy as I thought. Still works for me though, even though deviantArt has had another site change just earlier today apparently.

Okay, well... couple shots in the dark, maybe we'll hit something:

1) Try putting the cookies file directly into the folder where the exe file is stored, and change your config to just the cookie's filename without the file path.

2) When I add the --verbose flag, the output I get is this:

gallery-dl_1.26.0.exe -A=8 --config MY_CONFIG.conf https://www.deviantart.com/tyvadi/art/Mazes-Medusae-Team-Green-Side-Story-Part-2-684017850 --verbose
[gallery-dl][debug] Version 1.26.0 - Executable
[gallery-dl][debug] Python 3.8.10 - Windows-10-10.0.18363
[gallery-dl][debug] requests 2.31.0 - urllib3 1.26.17
[gallery-dl][debug] Configuration Files ['MY_CONFIG.conf']
[gallery-dl][debug] Starting DownloadJob for 'https://www.deviantart.com/tyvadi/art/Mazes-Medusae-Team-Green-Side-Story-Part-2-684017850'
[deviantart][debug] Using DeviantartDeviationExtractor for 'https://www.deviantart.com/tyvadi/art/Mazes-Medusae-Team-Green-Side-Story-Part-2-684017850'
[deviantart][debug] Using custom API credentials (client-id [######REDACTED########])
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): www.deviantart.com:443
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/user/profile/tyvadi HTTP/1.1" 200 1758
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /tyvadi/art/684017850 HTTP/1.1" 200 None
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/deviation/ED3D4D12-341C-761D-1476-67F225628BAB HTTP/1.1" 200 728
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/deviation/ED3D4D12-341C-761D-1476-67F225628BAB HTTP/1.1" 200 745
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/deviation/metadata?deviationids%5B0%5D=ED3D4D12-341C-761D-1476-67F225628BAB&mature_content=true HTTP/1.1" 200 952
[deviantart][debug] Active postprocessor modules: [MetadataPP]
[urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/deviation/download/ED3D4D12-341C-761D-1476-67F225628BAB?mature_content=true HTTP/1.1" 200 615
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): wixmp-ed30a86b8c4ca887773594c2.wixmp.com:443
[urllib3.connectionpool][debug] https://wixmp-ed30a86b8c4ca887773594c2.wixmp.com:443 "GET /f/d85a0c52-74b2-4221-a04e-2a31a71fc554/dbb8vju-b59950ca-3783-4f09-b578-e6e39d9a108b.pdf?token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ1cm46YXBwOjdlMGQxODg5ODIyNjQzNzNhNWYwZDQxNWVhMGQyNmUwIiwiaXNzIjoidXJuOmFwcDo3ZTBkMTg4OTgyMjY0MzczYTVmMGQ0MTVlYTBkMjZlMCIsImV4cCI6MTY5NzA1OTQzMCwiaWF0IjoxNjk3MDU4ODIwLCJqdGkiOiI2NTI3MTAwZTFiYzJhIiwib2JqIjpbW3sicGF0aCI6IlwvZlwvZDg1YTBjNTItNzRiMi00MjIxLWEwNGUtMmEzMWE3MWZjNTU0XC9kYmI4dmp1LWI1OTk1MGNhLTM3ODMtNGYwOS1iNTc4LWU2ZTM5ZDlhMTA4Yi5wZGYifV1dLCJhdWQiOlsidXJuOnNlcnZpY2U6ZmlsZS5kb3dubG9hZCJdfQ._yyQAi4qVxHxKFIsgL8DE5Yr_IqY8F5NhaiqtRsV5cI HTTP/1.1" 200 403453
* .\gallery-dl\deviantart\tyvadi\deviantart_684017850_Mazes Medusae- Team Green Side Story, Part 2.pdf

...Not that I can pretend to understand the full intricacies of gallery-dl's inner workings (I don't even speak Python), but perhaps something can be gleamed from comparing our verbose outputs?

a-washing-machine commented 1 year ago

woops, did not mean to close this

Hrxn commented 1 year ago

Don't think this has anything to do with gallery-dl's inner workings, it's simply the dA API behavior, caused by the changes they are making right in the moment, apparently..

ecophage commented 1 year ago

I mean we're running this at the exact same time i'd assume.

Could it be a mac vs pc issue? I don't think that'd be the case but I don't want to rule anything out (why would mac vs pc affect a website?)

here's my --verbose log

MacBook-Air:gallery-dl <User name is here it's just private info>$ gallery-dl --no-skip --verbose https://www.deviantart.com/tyvadi/art/Mazes-Medusae-Part-12-Transcript-671495689 [gallery-dl][debug] Version 1.25.8 [gallery-dl][debug] Python 3.10.4 - macOS-13.4-arm64-arm-64bit [gallery-dl][debug] requests 2.31.0 - urllib3 1.26.9 [gallery-dl][debug] Configuration Files ['/etc/gallery-dl.conf'] [gallery-dl][debug] Starting DownloadJob for 'https://www.deviantart.com/tyvadi/art/Mazes-Medusae-Part-12-Transcript-671495689' [deviantart][debug] Using DeviantartDeviationExtractor for 'https://www.deviantart.com/tyvadi/art/Mazes-Medusae-Part-12-Transcript-671495689' [deviantart][debug] Using custom API credentials (client-id <numbers are here>) [urllib3.connectionpool][debug] Starting new HTTPS connection (1): www.deviantart.com:443 [urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/user/profile/tyvadi HTTP/1.1" 200 1759 [urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /tyvadi/art/671495689 HTTP/1.1" 200 None [urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/deviation/50AE8955-2F89-52D5-EF87-F63184332D56 HTTP/1.1" 200 726 [urllib3.connectionpool][debug] https://www.deviantart.com:443 "GET /api/v1/oauth2/deviation/metadata?deviationids%5B0%5D=50AE8955-2F89-52D5-EF87-F63184332D56&mature_content=true HTTP/1.1" 200 871 [deviantart][debug] Using download archive '/Users/Ariel/Documents/archived Images/gallery-dl/archive.sqlite3' [deviantart][debug] Active postprocessor modules: [MetadataPP] [urllib3.connectionpool][debug] Starting new HTTPS connection (1): images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com:443 [urllib3.connectionpool][debug] https://images-wixmp-ed30a86b8c4ca887773594c2.wixmp.com:443 "GET /i/d85a0c52-74b2-4221-a04e-2a31a71fc554/db3she1-74c1e20e-b99c-44d3-99f3-f646589db780.jpg HTTP/1.1" 200 1689484 /Users/<username>/Documents/archived Images/gallery-dl/deviantart/tyvadi/deviantart_671495689_Mazes Medusae Part 12 Transcript.jpg

I tried moving the cookies, didn't change anything.

a-washing-machine commented 1 year ago

"[gallery-dl][debug] Version 1.25.8"

That's the wrong version. Current version is 1.26.0, it was released on October 3rd.

I just tested 1.25.8, that one indeed only gets the preview image. 1.26.0 gets the pdf file.

Although yes, as Hrxn pointed out, right now, as of literally earlier today (or yesterday depending on your time zone), deviantArt did change something again, making this not quite the best time to do a mass download until that problem is fixed. -_-

( It would've literally only taken one more day to finish my 6 day download marathon... grumble grumble )