thiennc1107 / search-analyzer

MIT License
0 stars 0 forks source link

[Question] What is the plan to workaround Google rate limiting #4

Open longnd opened 7 months ago

longnd commented 7 months ago

I'm aware that the part about scraping the search results from Google has not yet been implemented. However, if you had more time, what would be your solution to do so, and how would you will work around Google rate limiting?

I'm also aware that you raised a concern previously via email

I realize the Google search result has changed a lot since last time I crawled, which I believe makes the challenge much more difficult than it used to be :-(. I don't know if the one who gave the challenge is aware of this change yet.

Since you didn't provide more details, it is hard to know which issue you faced, but we didn't notice significant changes that prevented a possible solution to extract the search result.

thiennc1107 commented 7 months ago

Hi, about the update on google search result page, it seem like google is currently using some form of server side rendering JS framework instead of returning plain HTML like before, you can inspect on the returned page in developer tool: image The script section is very dense and the page cannot be opened in preview mode which hinted the page is rendered dynamicly using javascript. The other problem is that the since the page is rendered dynamically using javascript, the class name and ID seems to be rendered dynamically with random ID as well which make it hard to find the pattern or the structure. image The page data when scrolling down is also not sent by json API but some plain text structure: image I the above problems make implementing the crawler much more difficult than it used to be.

About the work around Google rate limiting, I think I'll start with some queueing mechanism using postgres or redis and crawling by batch and adding some random or jitter delay between each batch, I'll also implement process tracking for each keyword on the dashboard.

longnd commented 7 months ago

Hi, about the update on google search result page, it seem like google is currently using some form of server side rendering JS framework instead of returning plain HTML like before, you can inspect on the returned page in developer tool: image

Have you tried saving the result of this page into an HTML page and then opening it on a browser to verify the assumption? :) In fact, the result would be rendered as expected.

A simple way to get the search result is sending an HTTP request with a proper request path & user agent, then extracting the result from the result. A simple CURL command also work

curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3" https://www.google.com/search\?q\=golang
thiennc1107 commented 7 months ago

Thanks for your insight, it seem my network IP range is the thing that causing the problem: Before changing to mobile hotspot: image After changing to mobile hotspot: image And also, I think the class name problem still persists. Thanks for advice.