Closed wankio closed 1 year ago
this feature would be a huge game changer due to all of the 404s on reddit
any update on this?
OP is right, seems like the existing metadata already contains the correct URL to the preview pic
preview['images'][N]['source']['url']
Should be working with https://github.com/mikf/gallery-dl/commit/14af15bd18b0a8d937c20e3d7a8c063af344ebd0. When a non-reddit URL fails, it'll now download the reddit preview image.
with imgur it almost always seems to still exist in full quality on their https://i.imgur.com subdomain in the case anything that their mods decided wasn’t “advertiser friendly” - not stuff people deleted — it’s just removed from the API and the main https://imgur.com domain (just add the ID and an image extension for an image like .jpg, or .mp4 for video. made up example: https://i.imgur.com/abc.jpg
)
there’s also a bunch of images saved on web.archive.org, but there’s issues with case sensitivity i.e. they treat abC and abc as the same url i’m not sure not sure how to work around
Should be implemented with 14af15b. When a non-reddit URL fails, it'll now download the reddit preview image.
non-reddit URL can be anything, reddit scrapes and stores only jpg version of supported media. A better approach would be restrict it to jpg, png, webp
etc.
if i use this config there is no way to tell if actual URL was an image or video file, since now i have an image instead of actual video.
"reddit":
{
"filename": "{category}_{subcategory}_{id}.{extension}",
}
here is an example
D:\>gallery-dl -v "https://www.reddit.com/r/PornStarHQ/comments/147xq00/angie_faith_new_blonde_big_tiddy_starlet_is/"
[gallery-dl][debug] Version 1.26.0-dev
[gallery-dl][debug] Python 3.11.4 - Windows-10-10.0.19045-SP0
[gallery-dl][debug] requests 2.29.0 - urllib3 1.26.16
[gallery-dl][debug] Configuration Files ['%APPDATA%\\gallery-dl\\config.json']
[gallery-dl][debug] Starting DownloadJob for 'https://www.reddit.com/r/PornStarHQ/comments/147xq00/angie_faith_new_blonde_big_tiddy_starlet_is/'
[reddit][debug] Using RedditSubmissionExtractor for 'https://www.reddit.com/r/PornStarHQ/comments/147xq00/angie_faith_new_blonde_big_tiddy_starlet_is/'
[reddit][debug] Using custom API credentials (client-id 6gRuf*****************)
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): oauth.reddit.com:443
[urllib3.connectionpool][debug] https://oauth.reddit.com:443 "GET /comments/147xq00/.json?limit=0&raw_json=1 HTTP/1.1" 200 9765
[reddit][debug] Active postprocessor modules: [MetadataPP]
[imgur][debug] Using ImgurImageExtractor for 'https://i.imgur.com/B6r5qQa.gifv'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): api.imgur.com:443
[urllib3.connectionpool][debug] https://api.imgur.com:443 "GET /post/v1/media/B6r5qQa?include=media%2Ctags%2Caccount HTTP/1.1" 404 143
[imgur][error] HttpError: '404 Not Found' for 'https://api.imgur.com/post/v1/media/B6r5qQa'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): external-preview.redd.it:443
[urllib3.connectionpool][debug] https://external-preview.redd.it:443 "GET /bah-O1iQeAI_XfppJ2_CFXQmjvBJTOFGzA7SvrICJ2w.jpg?auto=webp&s=6a35571696ea2e644b45f53cd42a0e6ee208f0c2 HTTP/1.1" 200 87376
* F:\\dled-gallery-dl\reddit\PornStarHQ\reddit_PornStarHQ_147xq00.jpg
with imgur it almost always seems to still exist in full quality on their https://i.imgur.com subdomain in the case anything that their mods decided wasn’t “advertiser friendly” - not stuff people deleted — it’s just removed from the API and the main https://imgur.com domain (just add the ID and an image extension for an image like .jpg, or .mp4 for video. made up example:
https://i.imgur.com/abc.jpg
)there’s also a bunch of images saved on web.archive.org, but there’s issues with case sensitivity i.e. they treat abC and abc as the same url i’m not sure not sure how to work around
Imgur completely removed from its servers and CDNs. Not just from API endpoint. give some URLs.
Imgur completely removed from its servers and CDNs. Not just from API endpoint. give some URLs.
here's a quick one (NSFW): https://imgur.com/0iSOqVP (404), https://i.imgur.com/0iSOqVP.jpg (200). Not sure as to the percentage of working links. Do note that /a/ or 'album' IDs will not work with this method, you need the direct or 'image' ID.
there’s also a bunch of images saved on web.archive.org, but there’s issues with case sensitivity i.e. they treat abC and abc as the same url i’m not sure not sure how to work around
this would be a great source it can be made to work somehow, there's over 500TiB uploaded there
Imgur completely removed from its servers and CDNs. Not just from API endpoint. give some URLs.
here's a quick one (NSFW): https://imgur.com/0iSOqVP (404), https://i.imgur.com/0iSOqVP.jpg (200). Not sure as to the percentage of working links. Do note that /a/ or 'album' IDs will not work with this method, you need the direct or 'image' ID.
looks like a lot of these are slowly disappearing now, so archive.org would be the best best sadly
this would be a great source it can be made to work somehow,
I wonder if it would be possible to get gallery-dl to check the wayback machine whenever it encounters a 404 on an imgur link. That seems like the only solution here since the archive team managed to download EVERY single imgur link posted on reddit until December 2022.
this would be a great source it can be made to work somehow,
I wonder if it would be possible to get gallery-dl to check the wayback machine whenever it encounters a 404 on an imgur link. That seems like the only solution here since the archive team managed to download EVERY single imgur link posted on reddit until December 2022.
I agree that it is very worthwhile! Full quality images just steps away...
The only issue at the moment is the wayback machine is case-insensitive, so /cat.jpg, /cAt.jpg, /CAT.jpg, /cAT.jpg, etc all are treated as 'one' url so sadly it's not as simple as just grabbing an image from one url with archive.org.
I think there's a way to grab all 'snapshots' of a URL and then maybe via HTTP headers check each one to determine which one is the right one capitalization wise for gallery-dl to then use? But something to keep in mind, as there are capitalization collisions and you wouldn't want the wrong thing saved.
When using gallery-dl to download, if there's 404 in external links, in this case, imgur, it will skipped.
But preview content is accessible preview['images'][N]['source']['url']