mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.7k stars 953 forks source link

download reddit preview if external source is 404 #4322

Closed wankio closed 1 year ago

wankio commented 1 year ago

When using gallery-dl to download, if there's 404 in external links, in this case, imgur, it will skipped.

But preview content is accessible preview['images'][N]['source']['url']

cheese529 commented 1 year ago

this feature would be a huge game changer due to all of the 404s on reddit

snwefly commented 1 year ago

any update on this?

Hrxn commented 1 year ago

OP is right, seems like the existing metadata already contains the correct URL to the preview pic

preview['images'][N]['source']['url']
mikf commented 1 year ago

Should be working with https://github.com/mikf/gallery-dl/commit/14af15bd18b0a8d937c20e3d7a8c063af344ebd0. When a non-reddit URL fails, it'll now download the reddit preview image.

ofifoto commented 1 year ago

with imgur it almost always seems to still exist in full quality on their https://i.imgur.com subdomain in the case anything that their mods decided wasn’t “advertiser friendly” - not stuff people deleted — it’s just removed from the API and the main https://imgur.com domain (just add the ID and an image extension for an image like .jpg, or .mp4 for video. made up example: https://i.imgur.com/abc.jpg)

there’s also a bunch of images saved on web.archive.org, but there’s issues with case sensitivity i.e. they treat abC and abc as the same url i’m not sure not sure how to work around

ghbook commented 1 year ago

Should be implemented with 14af15b. When a non-reddit URL fails, it'll now download the reddit preview image.

non-reddit URL can be anything, reddit scrapes and stores only jpg version of supported media. A better approach would be restrict it to jpg, png, webp etc.

if i use this config there is no way to tell if actual URL was an image or video file, since now i have an image instead of actual video.

"reddit":
{
"filename": "{category}_{subcategory}_{id}.{extension}",
}

here is an example

D:\>gallery-dl -v "https://www.reddit.com/r/PornStarHQ/comments/147xq00/angie_faith_new_blonde_big_tiddy_starlet_is/"
[gallery-dl][debug] Version 1.26.0-dev
[gallery-dl][debug] Python 3.11.4 - Windows-10-10.0.19045-SP0
[gallery-dl][debug] requests 2.29.0 - urllib3 1.26.16
[gallery-dl][debug] Configuration Files ['%APPDATA%\\gallery-dl\\config.json']
[gallery-dl][debug] Starting DownloadJob for 'https://www.reddit.com/r/PornStarHQ/comments/147xq00/angie_faith_new_blonde_big_tiddy_starlet_is/'
[reddit][debug] Using RedditSubmissionExtractor for 'https://www.reddit.com/r/PornStarHQ/comments/147xq00/angie_faith_new_blonde_big_tiddy_starlet_is/'
[reddit][debug] Using custom API credentials (client-id 6gRuf*****************)
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): oauth.reddit.com:443
[urllib3.connectionpool][debug] https://oauth.reddit.com:443 "GET /comments/147xq00/.json?limit=0&raw_json=1 HTTP/1.1" 200 9765
[reddit][debug] Active postprocessor modules: [MetadataPP]
[imgur][debug] Using ImgurImageExtractor for 'https://i.imgur.com/B6r5qQa.gifv'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): api.imgur.com:443
[urllib3.connectionpool][debug] https://api.imgur.com:443 "GET /post/v1/media/B6r5qQa?include=media%2Ctags%2Caccount HTTP/1.1" 404 143
[imgur][error] HttpError: '404 Not Found' for 'https://api.imgur.com/post/v1/media/B6r5qQa'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): external-preview.redd.it:443
[urllib3.connectionpool][debug] https://external-preview.redd.it:443 "GET /bah-O1iQeAI_XfppJ2_CFXQmjvBJTOFGzA7SvrICJ2w.jpg?auto=webp&s=6a35571696ea2e644b45f53cd42a0e6ee208f0c2 HTTP/1.1" 200 87376
* F:\\dled-gallery-dl\reddit\PornStarHQ\reddit_PornStarHQ_147xq00.jpg
ghbook commented 1 year ago

with imgur it almost always seems to still exist in full quality on their https://i.imgur.com subdomain in the case anything that their mods decided wasn’t “advertiser friendly” - not stuff people deleted — it’s just removed from the API and the main https://imgur.com domain (just add the ID and an image extension for an image like .jpg, or .mp4 for video. made up example: https://i.imgur.com/abc.jpg)

there’s also a bunch of images saved on web.archive.org, but there’s issues with case sensitivity i.e. they treat abC and abc as the same url i’m not sure not sure how to work around

Imgur completely removed from its servers and CDNs. Not just from API endpoint. give some URLs.

ofifoto commented 1 year ago

Imgur completely removed from its servers and CDNs. Not just from API endpoint. give some URLs.

here's a quick one (NSFW): https://imgur.com/0iSOqVP (404), https://i.imgur.com/0iSOqVP.jpg (200). Not sure as to the percentage of working links. Do note that /a/ or 'album' IDs will not work with this method, you need the direct or 'image' ID.

there’s also a bunch of images saved on web.archive.org, but there’s issues with case sensitivity i.e. they treat abC and abc as the same url i’m not sure not sure how to work around

this would be a great source it can be made to work somehow, there's over 500TiB uploaded there

ofifoto commented 1 year ago

Imgur completely removed from its servers and CDNs. Not just from API endpoint. give some URLs.

here's a quick one (NSFW): https://imgur.com/0iSOqVP (404), https://i.imgur.com/0iSOqVP.jpg (200). Not sure as to the percentage of working links. Do note that /a/ or 'album' IDs will not work with this method, you need the direct or 'image' ID.

looks like a lot of these are slowly disappearing now, so archive.org would be the best best sadly

cheese529 commented 1 year ago

this would be a great source it can be made to work somehow,

I wonder if it would be possible to get gallery-dl to check the wayback machine whenever it encounters a 404 on an imgur link. That seems like the only solution here since the archive team managed to download EVERY single imgur link posted on reddit until December 2022.

ofifoto commented 1 year ago

this would be a great source it can be made to work somehow,

I wonder if it would be possible to get gallery-dl to check the wayback machine whenever it encounters a 404 on an imgur link. That seems like the only solution here since the archive team managed to download EVERY single imgur link posted on reddit until December 2022.

I agree that it is very worthwhile! Full quality images just steps away...

The only issue at the moment is the wayback machine is case-insensitive, so /cat.jpg, /cAt.jpg, /CAT.jpg, /cAT.jpg, etc all are treated as 'one' url so sadly it's not as simple as just grabbing an image from one url with archive.org.

I think there's a way to grab all 'snapshots' of a URL and then maybe via HTTP headers check each one to determine which one is the right one capitalization wise for gallery-dl to then use? But something to keep in mind, as there are capitalization collisions and you wouldn't want the wrong thing saved.