mikf / gallery-dl

Command-line program to download image galleries and collections from several image hosting sites
GNU General Public License v2.0
11.28k stars 919 forks source link

several 4chan archives appear to have blocked gallery-dl #5399

Open stubkan opened 5 months ago

stubkan commented 5 months ago

A few weeks ago I noticed that some archival sites, such as thebarchive and archived.moe were unable to be scraped by gallery-dl and were blocked. I decided to wait a while, to see if the issue went away, but it appears to still be present. Also, since it is occuring with more than one archival site, I think it may be a new security update of some kind that blocks robots?

Accessing the thread normally, in a browser works. But attempting to use gallery-dl to collect images fails with the following error message;

Scraping thread 916074222... 1/1 [archivedmoe][warning] Cloudflare challenge [archivedmoe][error] HttpError: '403 Forbidden' for 'https://archived.moe/_/api/chan/thread/'

I tested multiple 4chan archival sites to see which are working and which throw the cloudflare challenge and block;

boards.4chan.org - WORKS archive.4plebs.org - WORKS archived.moe - BLOCKED thebarchive.com - BLOCKED desuarchive.org - WORKS archive.palanq.win - WORKS

arch.b4k.co - BLOCKED however, the cloudflare notification is absent from arch.b4k.co, not sure if it is the same

Scraping thread 671665397... 1/1 [b4k][error] HttpError: '403 Forbidden' for 'https://arch.b4k.co/_/api/chan/thread/'

arisboch commented 5 months ago

I take it a spoofed user agent didn't help?

stubkan commented 5 months ago

I am not sure what that is, I dont see any mention of it on the documentation? I did have to set referer to blank to get one of the sites working prior to this block. Had been using gallery-dl without issue for half a year before this.

I tried adding gallery-dl --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0" but it throws the same cloudflare block

mikf commented 5 months ago

You need to provide cookies and user agent of the browser that can access these blocked sites: https://github.com/mikf/gallery-dl/issues/4844#issuecomment-1872529821

stubkan commented 5 months ago

Can you generate a success on your end? I do not seem to be able to.

   gallery-dl --user-agent browser --cookies-from-browser chromium https://thebarchive.com/b/thread/916060069
   [cookies][error] Failed to read from GNOME keyring
   [cookies][info] Extracted 3091 cookies from Chromium
   [thebarchive][warning] Cloudflare challenge
   [thebarchive][error] HttpError: '403 Forbidden' for 'https://thebarchive.com/_/api/chan/thread/'

I tried firefox instead;

   gallery-dl --user-agent browser --cookies-from-browser firefox https://thebarchive.com/b/thread/916060069
   [cookies][info] Extracted 375 cookies from Firefox
   [thebarchive][warning] Cloudflare challenge
   [thebarchive][error] HttpError: '403 Forbidden' for 'https://thebarchive.com/_/api/chan/thread/'

Thought, maybe I have to manually update the cookies myself, by visiting the site... did so... retried, and it imported 377 cookies instead of 375, but still failed

   gallery-dl --user-agent browser --cookies-from-browser firefox https://thebarchive.com/b/thread/916060069
   [cookies][info] Extracted 377 cookies from Firefox
   [thebarchive][warning] Cloudflare challenge
   [thebarchive][error] HttpError: '403 Forbidden' for 'https://thebarchive.com/_/api/chan/thread/'
mikf commented 5 months ago

"When there's no Cloudflare challenge for your browser and/or there's no cf_clearance cookie, you are out of luck." ... and there isn't on for thebarchive.com

It doesn't seem to work for archived.moe either, even though there is a cf_clearance cookie present.

This does work for sites with a "Verifying you are human" check like nhentai, but apparently not here.

stubkan commented 5 months ago

I got a cloudflare human check on thebarchive, and then retried - and it still did not pass.

It seems likely to me, these changes likely will eventually propagate to all the 4chan archives if they are apparently successful.

Here's a post by the maintainer of archived.moe;

image

stubkan commented 5 months ago

"When there's no Cloudflare challenge for your browser and/or there's no cf_clearance cookie, you are out of luck." ... and there isn't on for thebarchive.com

I checked my cookies, and there is a cf_clearance cookie for thebarchive - as well as the site requesting I pass a human check, which I clicked on.

But, gallery-dl is still blocked, unfortunately.

cheese529 commented 5 months ago

Do you think there might be a possible work around for this or are we out of luck here @mikf ? I know yt-dlp uses user agent + cookies to bypass cloudfare issues so maybe we could look at how they're able to scrape sites to see if we can get any info on how to deal with this.

Hrxn commented 5 months ago

Not sure, I tried the barchives example link from above, here's my log content:

[2024-04-01T09:38:51][info] Extracted 3140 cookies from Chrome
[2024-04-01T09:38:52][warning] Cloudflare challenge [Source URL: https://thebarchive.com/b/thread/916060069]
[2024-04-01T09:38:52][error] HttpError: '403 Forbidden' for 'https://thebarchive.com/_/api/chan/thread/' [Source URL: https://thebarchive.com/b/thread/916060069]

But looking in Chrome DevTools with this thread, I have these cookies (among some others);

__cf_bm
cf_clearance
csrftoken
foolframe_KmD_csrf_token

And I think cookie extraction from the browser should work... at least that's what my log says?

mikf commented 5 months ago

I know yt-dlp uses user agent + cookies to bypass cloudfare issues

That's what I recommended doing https://github.com/mikf/gallery-dl/issues/5399#issuecomment-2027253897, but it doesn't work in this situation https://github.com/mikf/gallery-dl/issues/5399#issuecomment-2027314873.

Hrxn commented 5 months ago

Maybe the API is completely blocked, i.e. always returning a 403 no matter what?

mikf commented 5 months ago

It is still accessible with a browser. It does, however, show an actual "Verify you are human" check. Solving it and using cookies and user agent afterwards does allow gallery-dl to access it, at least it did for me.

So go to https://thebarchive.com/_/api/chan/thread/, let your browser solve the challenge, and then do the cookie/user-agent thingy. This should allow using the API and downloading from thebarchive and archivedmoe.

Hrxn commented 5 months ago

Right, maybe I should've "used" the site more basically, because the "verify human" check did not pop for me in the browser.

I still saw the cf_clearance cookie in my browser devtools, but maybe it was simply a default or randomized nonsensical value. Should have considered the possibility of it being invalid, true.

Opening https://thebarchive.com/_/api/chan/thread/ in the browser immediately opens the "verify human" check, I toggled the mark, closed the browser and immediately tried the thread URL from earlier again (https://thebarchive.com/b/thread/916060069), and it was actually working this time.

It downloaded the grand total of one image, but I also only see one image in this thread when viewing in my browser, so this is probably correct, don't know what you guys see here..

So, @stubkan , given that you actually have the correct cookie in your browser, extraction (at least from thebarchive.com) seems to work - to answer the issue in this thread here.

mikf commented 5 months ago

maybe I should've "used" the site more basically, because the "verify human" check did not pop for me in the browser.

The problem here is that the "verify human" check does not appear when just regularly browsing the site in question, but only when explicitly visiting an API URL.

stubkan commented 5 months ago

Unfortunately, I have been doing these steps. I have passed cloudflare human check in browser (both firefox and chromium) and checking with both archivedmoe and thebarchive. I also tried a few different threads, and it gives me the cloudflare challenge block every time.

Can you please outline your command process that leads to success? I am doing it in this way in my previous comment;

mikf commented 5 months ago
$ gallery-dl --cookies-from firefox:/tmp/.firefox --user-agent browser https://archived.moe/gd/thread/309639/
archivedmoe/gd/309639 Which Adobe progr… Adobe_Systems_logo_and_wordmark.svg.png
archivedmoe/gd/309639 Which Adobe progr…hic design?/1495922648056 Sans titre.png
Hrxn commented 5 months ago

Yep, simply extracting the cookies from the browser definitely works. So, either --cookies-from-browser or via config, like this, for example.

            "cookies": ["chrome", "Profile 4"]

You only have to make sure that the cookies from the browser match the "user-agent" setting used by gallery-dl, like this for example:

            "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"

Or, like mikf suggested, by using "browser" as the string value for "user-agent", which will try to automatically use the UA information from your system's default browser. Also fine, if that browser is the one which has the correct cookies.

The problem here is that the "verify human" check does not appear when just regularly browsing the site in question, but only when explicitly visiting an API URL.

I see, but honestly, I don't think this is fundamentally different to how cookie-based auth works for any other site, yet. Might be possible that the problem here is quick expiration of the cookies, but I fail to understand how visiting this special API URL in your browser, making sure it succeeds so that you've got your correct cookies is more of a problem than the usual visiting the site in your browser, and sign-in with your credentials to get your correct cookies steps.

stubkan commented 5 months ago

export cookies in netscape.txt and add with gallery-dl -C

I tried this method, same cloudflare block. Perhaps the cookies are different for us, or is not matching up somewhere?

You only have to make sure that the cookies from the browser match the "user-agent" setting used by gallery-dl, like this for example:

my browser cookies for the API site in netscape format appear to consist of three cookies. There is no user-agent equivalent?

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by Cookie-Editor
thebarchive.com FALSE   /   FALSE   1714316255  foolframe_KmD_search_latest_5   <something containing - board talk text scrape start 2024-1-1 board talk>
thebarchive.com FALSE   /   FALSE   1712495038  foolframe_KmD_csrf_token    <short string of letters/numbs>
#HttpOnly_.thebarchive.com  TRUE    /   TRUE    1743587157  cf_clearance    <long string of letters and numbers>

I have done some brief testing, and it appears I can use other scraping methods, such as ripme, wget and python requests etc to scrape thebarchive without requiring any cookie nonsense. This may be unique to gallery-dl's method of using API?

Hrxn commented 5 months ago

No, the "user-agent" setting of gallery-dl must match the user-agent info of the browser with the cookies.

If you haven't set "user-agent" in your gallery-dl config, make sure to do so, because otherwise you'd be using the built-in default of gallery-dl, which is a recent version of Firefox ESR, but with a linux style. Unless you're using the exact same browser, the exact same version, on the exact same platform, you've got to change the "user-agent" setting.

Edit

https://github.com/mikf/gallery-dl/blob/ef0c90414c1077e42ae17ccec96eb4925d924c55/gallery_dl/extractor/common.py#L328-L331

stubkan commented 5 months ago

Thank you for your patience with me. I have successfully got gallery-dl to download a thread now...

gallery-dl -C cookies-thebarchive-com.txt -o "user-agent=Mozilla/5.0 (X11; Linux x86_64; rv:123.0) Firefox/123.0" firefox https://thebarchive.com/b/thread/739772332/
[1/2] firefox
[gallery-dl][error] Unsupported URL 'firefox'
[2/2] https://thebarchive.com/b/thread/739772332/
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500782886885 Smug face 0.png
./gallery-dl/thebarchive/b/739772332 Won't you loo….Look t her smuggy face.../1500783064989 image.jpg
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500783269175 Smug face 2.png
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500783400678 Smug face 3.png
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500783588112 Smug face 4.png
./gallery-dl/thebarchive/b/739772332 Won't you loo….Look t her smuggy face.../1500783596174 image.jpg
./gallery-dl/thebarchive/b/739772332 Won't you loo…t her smuggy face.../1500783824410 Smug face 6.png
./gallery-dl/thebarchive/b/739772332 Won't you loo….Look t her smuggy face.../1500784125831 image.jpg

I did accidentally leave in 'firefox' but still...

stubkan commented 5 months ago

Since one must now watch for cookie expiration and manually create cookie files as well as clicking the human verification button... It is kind of a troublesome solution... Should I leave this issue open?

I think it is possible to come up with a solution that does not require cookies, since alternative downloaders do work without requiring them?

stubkan commented 2 months ago

@mikf @Hrxn - Hope its ok to @, I thought since this issue is old, a comment may be missed.

It seems that the situation has changed again. While 4plebs used to not require any authentication - There began to be difficulties last month, with some requests randomly getting blocked. It now appears worse this month and has some form of cloudflare that is blocking gallery-dl, but still allowing browsers to pass without cloudflare.

I can access other foolfoku archive sites by creating a cloudflare cookie in firefox, by visiting the _/api/chan/thread/ and letting the browser create the cookie - and then extracting that. Then using it in the command line.

However, trying this method for 4plebs in the browser does not seem to create any cloudflare cookies or validation requests, all browser visits pass successfully without invoking cloudflare. It creates a foolframe_5SU_csrf_token cookie and if you click on 'accept' the cookie conditions on the non api site - it will create a second cookie called foolframe_5SU_cookie_hasConsent. These cookies do not appear to allow gallery-dl access to scraping unfortunately.

Attempts to delete or reset all cookies to try to get accessing 4plebs to re-create fresh cloudflare cookies doesn't seem to do much, as it does not seem to require cloudflare authentication at all if using a browser.

I tried combining cookies from another foolfoku site (cross cookies do seem to work for other archive sites, ie using thebarchive cookie to download from archived.moe, for example) but that doesnt work for 4plebs.

At this point, I am stumped and it seems like gallery-dl has a hard gallery-dl api block, even though it returns a 403 cloudflare error and the api is free to use via browser.

[4plebs][warning] Cloudflare challenge [4plebs][error] HttpError: '403 Forbidden' for 'https://archive.4plebs.org/_/api/chan/thread/'

zoekobii commented 1 week ago

Did anyone find a fix for this?