p0ody / ff2ebook

WIP.
http://www.ff2ebook.com
18 stars 2 forks source link

Cloudflare Fix (Bypass) to scrape Fanfiction.net #32

Closed bastien8060 closed 3 years ago

bastien8060 commented 3 years ago

The Proxies aren't very fast because there is the need to cycle through them (eg. not just keep using one that showed to be fast) in order to prevent them from being banned by CloudFlare. However, a lot of work has been put through optimization and cache to quicken requests.

The curl command has been replace by a self made curl function in Python that has the ability to keep the same cookies for the matching proxy along with its user agent for up to 5 requests. Then it cycles to the next proxy. It has a timeout of 5s. If it exceeds, or if the IP is banned by CloudFlare, it moves on the next Proxy in list.

It has Headless Javascript onboard (Using CloudScraper) which can solve/pass most basic and common CloudFlare Puzzle & Captcha (eg. the 5 Seconds wait page, here)

Requirements

Todo in the future:

Resolved Issues

31

30

27

p0ody commented 3 years ago

So i've uploaded it on http://ff2ebook.com/pr32/ to try it out, and I am getting 504 Gateway Timeout on ajax.GetFicInfos.php.

Any idea where this might be coming from ?

bastien8060 commented 3 years ago

Oh turns out cloudflare just recently (after my merge request) updated their "mechnism" if you will. This will now get even harder. Im so puzzled with how it works now.

Even copying a request header from a browser that just passed the cloudflare challenge, to curl for example gets blocked. I copy the user agent, cookies... Everything. Still blocked. Even copying the HAR request exported from Chrome's devtools fails.

My guess is that every requests gets sent back the same JavaScript challenge, a very fast one, and that if the browser has javascript and passes that challenge, the page is quickly changed to the requested content still via javascript. I'll work tomorrow on a possible fix, just for the sake of people wanting to clone the repo.

My idea is to use the headless JavaScript engine from Chrome, and/or to use the Automated Frontend of Chrome that lets a script control the Browser for automated purposes. This will pass the CloudFlare Challenge like a normal browser and will pass the content back. It even allows user interaction for people running it locally, which means they can even pass captchas. Also its compatible with windows, Linux and MacOs

The downside of that would be that its gonna be very slow... 12secs per Chapters. 10 for the CloudFlare bypass and 2 to load the page.

Another sollution would be to make use of chrome/firefox extensions, but thats tricky because it cant use proxies, and we could get their own ip banned. Also this is tricky, because we technically would have access to users' cookies and login detail so thats not the way I would go.

bastien8060 commented 3 years ago

@p0ody Also on the last commit from my fork that you pushed into https://ff2ebook.com/pr32 I added FicWad and WattPad support.

However you will need to update the table scheme because the source name is longer than Varchar(6) for (wattpad). Also the wattpad books IDs are too long to be Integers (int) so they need to be Varchar too.

That means those values gets truncated when saved and do not reflect the real values, resulting in an error when downloading it, since the fanfic is not found.

You can check my fork, where the sql files have been updated to reflect how the tables needs to be.

bastien8060 commented 3 years ago

Also, then that would close issues: #24 #13