pat310 / google-trends-api

An API layer on top of google trends
https://www.npmjs.com/package/google-trends-api
MIT License
894 stars 178 forks source link

HACK - Login Implementation #55

Closed Dayjo closed 7 years ago

Dayjo commented 7 years ago

This is my hacked implementation of the login. It seems to go and get the cookies ok, and then continue to send them with the following requests however it doesn't seem to avoid the rate limit. I still hit the CAPTCHA after maybe 10 or 20 attempts at running the example file.

This is not for merge, but more for a discussion as to how we should approach this.

I've re-written it to use request module again so that it can follow 300 http redirects which was necessary for the login at least. Also just modified the examples.js to attempt a login before running through an array of requests, this is a simple way of getting it to hit the rate limit (just give it a long array of terms to search for).

Not sure if @dreyco676 came across a similar issue with the python version? I realise that it's the BeautifulSoup library that handles the request sending using the same session cookies etc, so perhaps there's something else that I'm not sending here.

Dayjo commented 7 years ago

After looking into the pytrends, I think I'm having the same issue on that hitting the rate limit for my IP address.

I assume that the requests where the referrer is Google won't be getting rate limited, hence why the trends website still works, but if I run the XHR request in a new tab (no referrer) I get the CAPTCHA page. I can fill the CAPTCHA in and now it's all back and running ok again, but how long that lasts I have no idea. I'm wondering if it's even possible to bypass it at all.

I need to wait till I hit the limit again, and then run my original manual fix (https://github.com/pat310/google-trends-api/issues/36) and see if that gets past it (I don't think it will).

Some more investigation is probably needed, but I suppose with anything that doesn't have a real API is not going to be reliable

pat310 commented 7 years ago

@Dayjo At 3 second intervals with a random word each time, I was unable to hit a rate limit. Maybe logging in each time causes an issue or maybe sending multiple requests at the same time does? Either way I'm not sure, but I'm definitely able to get a lot more requests by logging in. I should have a PR up for this soon. Also, are you running this in the browser somehow? I thought there were CORS issues...

Dayjo commented 7 years ago

@pat310 No just in Terminal on Mac.

Yes, I think the frequency I am running the request may well be the problem. Unfortunately it's entirely possible I might need to run 3 / 4 requests at once.

Yeah I did consider that perhaps it should maintain the login session for any subsequent requests, probably a reasonable idea.

I shall await to see your PR :)

I am currently running some tests on the rate limit. I ran a request every half a second, twice in a row I got up to around 100 requests (50 seconds) before it started giving me the "you have done too many requests" page. I'm now going to try and work out if / when this rate limit times out (obviously can't make the server fill out a captcha)

Dayjo commented 7 years ago

I'm going to close this for now. I still think it's something to keep in mind but I was unable to get it to make a difference. I'm currently approaching avoiding the rate limit in a much more.. physical.. hacky way (load balancing multiple IPs).

pat310 commented 7 years ago

@Dayjo Thanks! When I thought I was close to a solution before it didn't seem to actually make a difference either and I decided it was not worth the security risk of people providing their google credentials. I saw in thstarshine's fork that he was using a proxy, probably similar to your current approach.

Dayjo commented 7 years ago

@pat310 oh that's nice, it certainly makes a good option to be able to configure it to make requests through a proxy. You may even want a load balancing method so that you can distribute each request through a pool of different proxies.

I'm currently just calling the script from different servers :P