justcoding121 / titanium-web-proxy

A cross-platform asynchronous HTTP(S) proxy server in C#.
MIT License
1.93k stars 618 forks source link

Rate Limited Suggestions #836

Open STRATZ-Ken opened 3 years ago

STRATZ-Ken commented 3 years ago

Hello all, first off great product. Well documented, well laid out. Just works.

I am trying to figure out an on-going issue which I cannot track down. This may not be a Titanium Web issue at all, but maybe I can be pointed in the right direction as I am running out of ideas.

My project is based on web scraping with Selenium C#. To avoid detection, I have alot of IPv4 address which I rotate at the firewall level. (If your interested in NAT Outbounding, check out PFsense). Basically every outgoing connection socket gets put on a random IP address.

I wrote a basic script in C# that calls https://api.ipify.org?format=json to verify this information. I can see that my IP is changing every request. Cool!

Now, when I start scraping obviously there are basic rate limits that is imposed based on IP. My program is based to start scraping with Chrome, with a proxy set to 127.0.0.1:8000 which is direct to Titanium Web Proxy. I then tested my IP again with IPIFY, worked. However, after 4 hours I got rate limited by the site.

So I attempted to open up Chrome manually and go to the site, it worked. Turned back on the program, still rate limited. Interesting?

My next attempt was proxy the traffic directly at the Windows 10 level and apply it on my machine. Visit the web site, worked. Went back to my application with Titanium Web Proxy, no go.

It is almost like something is happening with Titanium Web proxy that the other side can "see" and is blocking me. If I use my same app but proxy through my OS, I have no issue. The second I bring in Titanium Web Proxy and funnel through that, I am getting rate limited.

I then verified this theory last night, and kept Titanium off and ran my application for 14 hours with no issues.

I think its also important that I explain how my network is designed :

Program > Titanium Web Proxy > Squid Proxy > Firewall (Random IP) > Internet ---- This does not work Program > Squid Proxy > Firewall (Random IP) > Internet ---- This does work

I had to do this like this because I use Titanium's random Up Stream HTTP Proxy function. Some of the users of my application use 3rd party proxies and they get 100 Logins for 100 IP addresses. Your application is very easily configurable for this feature.

Any thoughts on what the issue might be with Titanium in this scenario?

justcoding121 commented 3 years ago

May be you can check the request headers send to Squid from Titanium and without Titanium to check if there is any headers that would indicate the original IP.

STRATZ-Ken commented 3 years ago

I did not see any headers being added by manual checking, also didnt see any headers when testing with a client, or even in the code.

STRATZ-Ken commented 3 years ago

So I dug a bit more. It seems like web hosts will limit the client once they see Selenium Chrome with --proxy-server=127.0.0.1:8000. Really has little to do with Titanium.

So I guess I revamp my question, is it possible to traffic data to Titanium without using this feature of Chrome?

justcoding121 commented 3 years ago

So if it's a request header added by selenium chrome, you can remove that header inside request handler on Titanium proxy.

STRATZ-Ken commented 3 years ago

I dont see it being a header. See these systems like Cloudflare are able to detect specific settings by running scripts on your machine. That is why they say "Wait 5 Seconds for us to detect". My guess is some hosts will do this as well and limit after so many requests a specific proxy setting.

Is there anyway to capture the HTTP traffic in my web app and force it through Titanium?

justcoding121 commented 3 years ago

I am little unclear on what you meant by "they see Selenium Chrome with--proxy-server=127.0.0.1:8000"

How do they see that? I assumed that information was added as a header to all requests by selenium driver.

justcoding121 commented 3 years ago

Instead of telling selenium chrome to use Titanium proxy, may be you can set Titanium as system proxy. So all requests from chrome will go through the proxy automatically.