tanaponpiti / google-search

0 stars 0 forks source link

Simpler way to perform the search request on Google and fetch the result #20

Open longnd opened 7 months ago

longnd commented 7 months ago

Issue

The application relies on Puppeteer instances to perform the searches and get the HMTL response. Multiple instances are pinned up to speed up the process, and also workaround Google rate limiting as each instance has different IPs.

But there is room for improvement in the taken solutions as:

tanaponpiti commented 7 months ago

Puppeteer is slow and resource-intensive for the search request as it requires running an actual browser in the background. Isn't using a HTTP client to send the request is enough? A simple CURL command can demonstrate the idea

Sadly that simple CURL is not enough. It is the first approach that I have tried when scraping Google Search. However, curl only return response from server side. Server side data will only include numbers of link for that given keyword but there will be not AdWord and "About 4,060,000,000 results (0.28 seconds)" will not appear. I believe Google Search is isomorphic, and it seems that those information will render later in client side. So, I have left with no choice but to scrape it from the headless browser like puppeteer.

The idea of overcoming Google rate limiting by rotating the IP of the request is good, but the implementation of using multiple Google Cloud Run instances increases the complexity and requires more time to implement. Some free proxy solution like https://proxyscrape.com/ would work.

I have very bad experience with free proxy ip as it most of the time got blacklisted by many of websites. It may works for a while but may failed at any time possible in production. The only way to go for me when it comes to scraping is have multiple IP which can be rotate on demand. I have quite good experienced using https://brightdata.com for that purpose. However, I do not want to spend any of money during the testing process. So I decided to go with Google Cloud Run and its free tier package. The benefits of Google Cloud Run is, as you understand, it automatically rotate IP for our scraper every time for each instances. It also mitigate problem with puppeteer consuming at lot of resource too, as the instance only held resource while it has task to compute on. Therefore you will be charge for a very small price in the duration of scraping only.

If rotating the IP is too complex, a simpler workaround is to rotating the user agents used in the request, simulating that each requests sent from different client.

Rotating the user agents failed me on many website as we also knew that it can be spoof very easily. Let's say that we are on the Google Search Side that has to protect our website from being scrape. I believed that we will definitely not blacklist the user agent but will go directly with IP instead. In this testing process, I did not have much time to do the reverse engineer of Google Search to fully understand of how it block ones to perform search, but I'm surely that IP block is definitely one of them.

longnd commented 7 months ago

However, curl only return response from server side. Server side data will only include numbers of link for that given keyword but there will be not AdWord and "About 4,060,000,000 results (0.28 seconds)" will not appear

I believe this is inaccurate, pls try this CURL command to see the result

curl -A "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:122.0) Gecko/20100101 Firefox/122.0" https://www.google.com/search\?q\=galaxy+s24 -o result.html
here is how it look image

The same result can be acquired using an HTTP client. https://pkg.go.dev/net/http seems to be one of the option for Golang :) It would be much faster and light way than spinning up a browser instance for each requests with puppeteer.

I have very bad experience with free proxy ip as it most of the time got blacklisted by many of websites. The only way to go for me when it comes to scraping is have multiple IP which can be rotate on demand

I agree that rotating the IP is usually a costly solution as we can't rely on free services.

The only way to go for me when it comes to scraping is have multiple IP which can be rotate on demand

I don't think it is the only solution. There are other workarounds we can think of like rotating the user agent (as seen in the above CURL command) or adding a delay between each scraping job for each keyword.

I believed that we will definitely not blacklist the user agent but will go directly with IP instead

If I understand correctly, you're also sharing the idea that UserAgents will not likely be blocked, compared to IPs. So it makes more sense to use it instead, no?

tanaponpiti commented 7 months ago

I believe this is inaccurate, pls try this CURL command to see the result

Thank you for showing me this. You were right about this CURL. It seems that google search act differently based on give user-agent. If I happened to knew this before I would not using Puppeteer to scrape data. As you suggested, a HTTP client would be sufficient already. I will try to make changes to this issue by implement additional feature of html-retriever to support using node http library like Axios. So, that it will require much lower resource consumption and still have access to IP rotating feature.

If I understand correctly, you're also sharing the idea that UserAgents will not likely be blocked, compared to IPs. So it makes more sense to use it instead, no?

Oh, what I actually means is that if the IP got blacklisted(given that there is that kind of protection), google search should always redirect to captcha regardless of given user-agent. So basically, even if we change user agent in to entirely something else but if it's still the same IP that got blocked our scraper will still have to solve captcha to unblock that IP. If I'm not mistaken it even show in google error message that "There is some suspicious activity from {{IP}}...". So, I tried to solve the problem by changing IP instead of user-agent. However, I have not tried to change user-agent yet. If we can bypass captcha by changing user agent then it would definitely be much preferable solution compare to IP rotate.

adding a delay between each scraping job for each keyword.

I believe this also a solution, but for a different problem. Switching to new IP solves "how to bypass captcha once you got blocked". Adding a delay between each scraping job solves "how to not get block".

Changing to different user-agent might work as well(I have not test it yet).