p0ody / ff2ebook

WIP.
http://www.ff2ebook.com
18 stars 2 forks source link

Proxy connections might not properly time out #46

Closed StarWolf3000 closed 2 years ago

StarWolf3000 commented 2 years ago

For the last 45 minutes I've been trying to download a larger fic from FFN (76 chapters), but the highest chapter count I could get was 53 of 76, before I finally gave up on trying.

Screenshot_20211002-222002

Screenshot_20211002-222332

Even a shorter one (10 chapters) took me 5 or 6 attempts before all 10 were retrieved.

p0ody commented 2 years ago

It's partly because of selenium slowness and proxies, but the big problem was the error handling and the retrying mechanism not working properly.

I've pushed the revision to the live website, so test it again, it should be better. (I haven't committed the changes yet)

(The fic you used as example will be cached already because i tested with it.

StarWolf3000 commented 2 years ago

Thanks, I will test it soon.

Also every time I made a request, it warned about not finding the "fic type", and when I did not check "Force update", the warning appeared twice.

As for the future on implementing more comfortable bypassing on Cloudflare's protection, flaresolverr was often mentioned in the other issues, but I don't think it's a viable solution when running on a public webserver, because of the way it does that (launching a Chromium instance every time a request is made).

Edit: After further reading, Selenium does basically the same as flaresolverr, opening browser instances?

p0ody commented 2 years ago

Yes, im running remote selenium to an old linux box sitting in the corner of my room. This is a bad way of doing it i know.

I've done some test with cloudscraper (python), it worked well on my local machine, but get blocked when testing on my web host, so i need to do some more testing soon.

bastien8060 commented 2 years ago

@p0ody Trust me cloudscraper isn't a good solution. Learnt it the hard way. One of my pull request used cloudscraper, however, it gets blocked too quickly by cloudflare when under huge load like with ff2ebook (has many visitors). You will need good proxies for it.

After passing many captchas, and still getting flagged by cloudflare, cloudflare will save that IP and upgrade the captcha challenge to an image based captcha where you have to click the right image (trains, bicycle, buses etc...). Cloudscraper/scrapy doesn't support those. The library will throw an error saying image challenges are not supported on the free version. Of course they don't have a paid version so it's another way to say it isnt supported.

Hope this helps !

p0ody commented 2 years ago

The weird thing is that im using cloudscraper with the same proxies on both my local machine and web host, it works like 95% of the time local but never on my web host. I'm getting the the error that you are refering.

I might need to play around with using headers from my local machine or something.

Anyway, thanks for the heads up.

bastien8060 commented 2 years ago

Note/Edit: I know you mentioned using proxies, but I would check if you webhost IP leaks through DNS requests.

Maybe somebody already used cloudscraper, or managed to ban your ip, on your webhost before you did. (Maybe your server shares its ip with multiple other servers/or it got banned before your bought the server). AFAIR, cloudflare shares ban IP among domains names it secures but I could be totally, completely wrong about that. Also, cloudflare checks things like what organizations the IP comes from, and can detect if you use a webhost.

Otherwise, it can be on your system, in which case, you can try using something like Docker.

p0ody commented 2 years ago

I think the problem comes from cloudflare figuring out somehow that it is ran from a server (maybe no display or something).
I've tried running cloudscraper off my home ubuntu server (No screen) and im getting the same error as when ran from my web host but it works from my windows PC (Same external IP address as my ubuntu server).

Im not experienced in these kind of thing (HTTP headers and all), do you know if they can get this information ?

bastien8060 commented 2 years ago

@p0ody You might want to run a proxy in between to check that information. Also if you used a JS solution, I would suggest they check the viewport size (which cloudflare does), but you don't run JS, so nevermind that.

I would also try a python virtual env (if you use python, otherwise something else), where you add/use the same version of Python3, and all the modules too, so you can be guaranteed you run the same version of every modules/libs.

pip install virtualenv #install virtual environment
source mypython/bin/activate
pip install -r $PathToRequirement.txt

This will make sure you have an identical workspace as on your windows machine. Also that reminds me I had the same issue and the problem for me was the Python version (python -V to check). You should check that. On Gnu/linux, other versions of python can be ran by changing the command (If you installed other versions of python, which you can do). Run ls /usr/bin/pytho* to find all versions of python you can call.

Edit: IIRC, cloudscraper worked for me with Python/Anaconda 3.8 on CentOS 6. This might help :)

bastien8060 commented 2 years ago

@StarWolf3000 FlareSolverr is a docker image running Selenium. It is just a wrapper library to my understanding :) ! Edit: +the lightweight enough docker they composed.

LanzCorporalAssWipe commented 2 years ago

Is the site down? All I'm getting is connection time out.