mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.72k stars 955 forks source link

Deviantart | Gallery-dl only downloading Journals and not Polls, Status Updates, from User Posts #3539

Open Aidanjosiah02 opened 1 year ago

Aidanjosiah02 commented 1 year ago

I am attempting to download all posts made by some artists on Deviantart. However in the "Posts" page it only grabs the "Journals" and excludes "Polls" and "Status Updates". Attempting to use a direct link such as "https://www.deviantart.com/<user>/posts/polls" returns [gallery-dl][error] Unsupported URL '<URL>' even though those posts exist. Using the option --list-keywords also does not show any sign of these other posts.

I use Windows 10, gallery-dl pip version 1.24.2, and the related settings in my config are:

        {
            "client-id": "<id>",
            "client-secret": "<secret>",
            "extra": true,
            "folders": true,
            "group": true,
            "include": ["all", "journal", "scraps"],
            "refresh-token": "<token>",
        }

Removing "journal" from "include" also does not work.

I have attached the verbose of one of my runs. verbose-t1na-posts.txt

ClosedPort22 commented 1 year ago

Can confirm gallery-dl doesn't support these types yet.

I'm working on this, but the situation seems to be rather complicated:

ClosedPort22 commented 1 year ago

I've added support for some status posts (#3541). The executables can be found here: https://github.com/ClosedPort22/gallery-dl/actions/workflows/executables.yml

You can try this out by using gallery-dl https://www.deviantart.com/<user>/posts/statuses. Or you can use "include": ["status"] or "include": "all" (not ["all"]) in your config file to enable this for all user URLs.

Please let me know if you find any bugs or have suggestions on how this could be improved (apart from the currently missing features).

Aidanjosiah02 commented 1 year ago

Thank you so much; it worked amazingly! That was also pretty quick with the update.

succesful_stauses-verbose.txt

ClosedPort22 commented 1 year ago

No problem. I think you closed the issue too soon, though. I'm not the maintainer of the repo and this hasn't been merged into master yet. I'm still working on it.

Aidanjosiah02 commented 1 year ago

When encountering posts that contain neither "deviation" or "status" in the status type, it throws a "KeyError" when searching for either key, namely, "deviation" or "status". In the section starting at line 787 in deviantart.py, the problem occurs when "gallery" is in item as opposed to the only two defined keys. Instead of key = "deviation" if "deviation" in item else "status" on 791, it should include some sort of elif "gallery" in item: sort of thing to deal with "gallery" as well. I am not very familiar with the code of this program, so for now I can't really think of a proper solution.

ClosedPort22 commented 1 year ago

When encountering posts that contain neither "deviation" or "status" in the status type, it throws a "KeyError" when searching for either key, namely, "deviation" or "status". In the section starting at line 787 in deviantart.py, the problem occurs when "gallery" is in as opposed to the only two defined keys. Instead of: key = "deviation" if "deviation" in item else "status". It should include some sort of elif "gallery" in item sort of thing to deal with "gallery" as well. I am not very familiar with the code of this program, so for now I can't really think of a proper solution.

Hm, the official documentation made no mention of the gallery field. Can you provide a link to the post that triggered the error?

Aidanjosiah02 commented 1 year ago

This is one of the accounts that did it, but no longer does? https://www.deviantart.com/maxeralfa017/posts/statuses I made a quick workaround earlier, but can't seem to get it to fail anymore.

Instead now it looks like this: other-error-verbose.txt

Aidanjosiah02 commented 1 year ago

Found one: https://www.deviantart.com/dsana/posts/statuses error-verbose.txt

Here it can't find "status". But it will find "gallery" if you tell if to search for that. My quick workaround was to say

key = ""
if "deviation" in item:
    key = "deviation"
    yield item[key]
elif "status" in item:
    key = "status"
    yield item[key]
else:
    continue

But I suspect it should be handled in a better way. I'm not sure how many posts it misses using this that it shouldn't.

ClosedPort22 commented 1 year ago

Got KeyError at the same position in the API response, but there was no gallery field:

            "items": [
                {
                    "type": "thumb_background_deviation"
                }
            ]

The error should be fixed in https://github.com/mikf/gallery-dl/pull/3541/commits/c4aeca7a5a31c465fecd56992f03577ca5f358e4, and unexpected fields will simply be ignored for now (you can enable the metadata postprocessor if you don't want to lose this information).

Aidanjosiah02 commented 1 year ago

Thanks for the update! Now when I get the version from yesterday to print, it says {'type': 'thumb_background_deviation'} [deviantart][error] An unexpected error occurred: KeyError - 'status'. I don't know how I thought it was "gallery". Anyway this new version works much better now; none of the statuses and images therein appear to be skipped, though, are placed outside the "Status" subfolder. This isn't really a problem for me since I can rename their paths in the batch script I made, but just a heads up anyway.

ClosedPort22 commented 1 year ago

This isn't really a problem for me since I can rename their paths in the batch script I made, but just a heads up anyway.

You can specify directory and archive-fmt for status like this (see also: https://github.com/mikf/gallery-dl/blob/master/docs/configuration.rst#extractordirectory):

"deviantart": {
    "status": {
        ...
    }
}

I assume people tend to share their own deviations in status updates, and that's why status is disabled by default. I haven't put this version into production yet, but I would probably override the default archive-fmts and use the same archive format for all DeviantArt extractors.

Aidanjosiah02 commented 1 year ago

Not sure if this is possible with that. I see I didn't explain the problem correctly. The images that come from the user's "stash" are placed in the user's root while the statuses are kept in "Status". At the time of describing the problem I didn't know the images came from the user's "stash", but after seeing that is the case, I'm not so sure if the problem I raised is actually a problem. eg. If someone wants to keep all the stash files separate from all other items.

I did notice something else that may be a minor problem if someone is concerned with preventing duplicates. Images from artists who are not the poster of the status update are placed into the poster's "Status" subdirectory rather than the original artist's directory. The metadata of the status updates appear to be enough for someone to be able to create softlinks across different artists, so having some switch to always "respect des fonds" will still allow consistent access.

These images I'm referring to all seem to have this in their metadata:

"items": [
    {
        "deviation": {
            "url": "https://www.deviantart.com/<artist>/art/<image>"
        }
        "type": "thumb_background_deviation"
    }
]

Still, this is a very minor problem as it seems very rare for this to occur, and we're talking like under 20MB of duplicates per artist.

ClosedPort22 commented 1 year ago

The images that come from the user's "stash" are placed in the user's root

Yeah, it's been like that since the beginning. After using gallery-dl for a while I decided to change the directory format to {username}/Stash for clarity. The stash extractor can be configured in the same way as described above.

You might see some stashed deviations in the "Status" folder as well, and that's because shared deviations are directly extracted by the status extractor rather than delegated to the stash extractor. It's even possible to merge the output into one directory by using conditional directory naming:

"deviantart": {
    "stash": {
        "directory" : ["{username}", "Stash"]
    },
    "status": {
        "directory": {
            "'sta.sh' in url": ["{username}", "Stash"],
            "": ["{username}", "Status"]
        }
    }
}

Images from artists who are not the poster of the status update are placed into the poster's "Status" subdirectory rather than the original artist's directory.

This can also be achieved through configuration, thanks to gallery-dl's flexibility in this regard.

"deviantart": {
    "status": {
        "directory": ["{author[username]}", "Status"]
    }
}

Or even:

"deviantart": {
    "status": {
        "directory": {
            "author[username] != username": ["{author[username]}", "shared"],
            "": ["{username}", "Status"]
        }
    }
}
Aidanjosiah02 commented 1 year ago

Holy smokes you are a genius! I'll get on applying this after I get some sleep

ClosedPort22 commented 1 year ago

By the way, if you would like to minimize the chance of getting duplicate files, you can check out the archive function. I personally recommend including filesize in the archive format because it helps to detect modifications, re-uploads, etc. I'm currently using {_username}_{index}_{download_filesize|content[filesize]}.{extension}. If you download from artists from the same fandom or topic, you can even use a common archive database for them (I do this for Tumblr and Twitter). This way the shared content between them (e.g. retweets, reblogs, shared deviations) can be collectively managed and will only ever be downloaded once.

Aidanjosiah02 commented 1 year ago

For the earlier post, this seems to achieve what I need:

"deviantart":
        {
            "stash": {
                "directory": ["deviantart", "{author[username]}-[{author[userid]}]", "Stash"]
            },
            "status": {
                "directory": {
                    "'stash' in subcategory": ["deviantart", "{author[username]}-[{author[userid]}]", "Stash"],
                    "'/art/' in url": ["deviantart", "{author[username]}-[{author[userid]}]", "All"],
                    "": ["deviantart", "{author[username]}-[{author[userid]}]", "Status"]
                }
            },
            "journal": {
                "directory": {
                    "'stash' in subcategory": ["deviantart", "{author[username]}-[{author[userid]}]", "Stash"],
                    "'/art/' in url": ["deviantart", "{author[username]}-[{author[userid]}]", "All"],
                    "": ["deviantart", "{author[username]}-[{author[userid]}]", "Journal"]
                }
            }
        }

I can still create links to the correct files from the status update despite the target image being somewhere else using:

"items": [
    {
        "deviation": {
            "author": {
                "userid": "D3DBBBAF-E006-8D38-8687-0F15E669E9E8"
            }
            "deviationid": "902FA586-DA0F-7E01-F65C-1C163EADEF00"
        }
    }
]

included in the metadata, respecting the fonds and preventing duplicates.

For your last post, I currently do use the archive file, but that's cool you can tell it how to store the info. Also is there any downside you know of to using {author[username]} over {username} for the archive file? Again, thank you for the help you have given me so far!

ClosedPort22 commented 1 year ago

Also is there any downside you know of to using {author[username]} over {username} for the archive file?

There really shouldn't be any. One thing that I can think of is that for posts without author[username], the field will simply become None and that may, assuming {index} is not always unique, cause some posts to be skipped erroneously. But in reality I've never seen any posts without it.

I'd also recommend getting a SQLite viewer so you can verify that your archive-fmt is working as intended.

Aidanjosiah02 commented 1 year ago

One thing that I can think of is that for posts without author[username], the field will simply become None and that may, assuming {index} is not always unique, cause some posts to be skipped erroneously.

I see what you're saying since there are some files with an index of 0. Perhaps using {deviationid}_{download_filesize|content[filesize]}.{extension} would work? I assume the {deviationid} will always be unique regardless of author. And yes I have an SQLite viewer.

Aidanjosiah02 commented 1 year ago

Nevermind {deviationid} doesn't work with non-images, and I can't seem to find a good replacement. Have you ever seen a post without {author[username]} available, or is this just theoretical? If {author[username]} can be unavailable I would assume {author[userid]}_{index} also wouldn't work.

Edit: {subcategory}_{deviationid|statusid}_{download_filesize|content[filesize]}.{extension} appears to be reliable. Stash, Journals, deviations, all have a {deviationid} keyword. It only seems to be Statuses that are different and use {statusid}.

ClosedPort22 commented 1 year ago

Nevermind deviationid doesn't work with non-images

For now it's possible to use {deviationid|statusid} to get a UUID for every post (see https://github.com/mikf/gallery-dl/blob/master/docs/formatting.md), but this might change when the PR is reviewed by @mikf.

Have you ever seen a post without {author[username]} available, or is this just theoretical?

It's purely theoretical.

ClosedPort22 commented 1 year ago

there are some files with an index of 0

Have you actually seen that happening? I always thought it was to prevent the program from crashing in case the API returned something unexpected. It was an issues during development, but it should've been fixed by https://github.com/mikf/gallery-dl/pull/3541/commits/013733c9e9e5d8196fc280641b2cbe940cc8ac5a.

Aidanjosiah02 commented 1 year ago

Have you actually seen that happening? I always thought it was to prevent the program from crashing in case the API returned

Yes, actually. These posts in particular: https://www.deviantart.com/dsana/status-update/20937531 https://www.deviantart.com/dsana/status-update/21573532 I cannot replicate the problem testing with my account's stash, so I really don't know. The metadata literally says "index": 0 in both. One thing they have in common is they both use images from the artist's stash, but other images from stash don't do this. I keep using this artist for examples since it's the only one I stumbled on that is able to create these problems. Here's an image: example

ClosedPort22 commented 1 year ago

Maybe you forgot to update to the latest commit? I can't reproduce the issue on https://github.com/mikf/gallery-dl/commit/013733c9e9e5d8196fc280641b2cbe940cc8ac5a.

Aidanjosiah02 commented 1 year ago

Oh yeah that fixed it.