[Question] Bypass the Google restrictions

malparty commented 3 weeks ago

Issue

I went through the commit that implemented the User Agent rotation. The idea is interesting and is often the first one tried (with more or less success).

If you had more time for this challenge, what other technics would you explore and try?

macropusgiganteus commented 3 weeks ago

Thank you for your feedback.

Aside from the user-agent rotation and random delay, I also tried scraping data from google cache by prepending the url with "http://webcache.googleusercontent.com/search?q=cache:" but it doesn't seem to work.

Other technics I would try might be from this article:

IP-Rotation Using a proxy can help rotating my IP but the problem is it might cost me money. Then I found these stackoverflow questions (Rotating IP with selenium and Tor and Change IP address in ruby), I would try setting up a torproxy to rotating my IP. ** Another problem from using tor is it will be very slow.
Headless scraping I haven't looked into the details of this technic but the relevant library is selenium-webdriver and I might implement it following this article

malparty commented 3 weeks ago

Thanks for your reply.

According to you, what are the pros and cons for 1 and 2? (they both have different advantages and inconveniences, so knowing them would help to choose which approach to use).

macropusgiganteus commented 3 weeks ago

Thank you for the question.

IP-Rotation
- Pros:
  - Reliability: it can rotate to a new IP if the old one is blocked, reducing the chance of getting blocked.
  - Scalability: it can distribute workload across multiple IP addresses.
- Cons:
  - Cost: using multiple IP addresses or proxy services can incur additional costs.
  - Complexity: it requires additional setup and management for the proxy server which can increase the complexity of the application.
  - Performance: over-rotation of IP addresses can lead to slower scraping speeds and increased resource usage.
Headless scraping
- Pros:
  - Fast: it consumes fewer resources per website and has fast loading time, resulting in faster process.
  - Scraping dynamic content: it has the capability to extract data from dynamic pages or Single Page Applications (SPAs)
- Cons:
  - Hard to debug: we'll have to review and debug the HTML code manually if the website's structure changed.
  - Maintenance: it require updates and maintenance to keep up with web technology changes.

If I had to choose one of them, I would try implementing headless browser scraping first since it doesn't require setting up services outside of the application and doesn't incur any costs.

References:

macropusgiganteus / scrappy-web

[Question] Bypass the Google restrictions #13

Issue