tanaponpiti / google-search

0 stars 0 forks source link

How does it work

components.png

There are six components in the whole system in order to function properly.

Three of which, includes web, api, html-retriever, were implemented inside this repository.

The problem

Surely, Google Search doesn't want its own data to be scraped. So if there are too many request sending to Google Search, it will block our IP from accessing it and demand CAPTCHA to be solved first.

I have tried to use TOR network to relocate IP to another. However, Google seems to list all most all of TOR node IP to be in their blacklist and still require CAPTCHA to be solved.

So, I have to find a way to allow my HTML Retriever service to gain new "clean" IP to avoid google blockade.

The Solution

Normally, if I have a budget, I would use IP Proxy pool service and request new "clean" IP. However, I didn't want to spend more than Google Cloud Free trial credit. Since I have to deploy this system on Google Cloud, my solutions is to use Google Cloud Run (Serverless) for the HTML Retriever service.

Google Cloud Run is a serverless service from Google that allow docker container to be run on demand. It will start up when there is a request for the service and also scale up into multiple instance based on numbers of usage. I happened to found out that each instance have set of different egress IP. Therefore, I implement HTML Retriever service to terminate itself everytime its IP have been block by google search.
When it restarted by Google Cloud Run it will automatically have different IP to continue scraping.

Finally, our system diagram will look like this.

deployment.png

Demo Instance

Until my Google Cloud Trial free credit ran out. You can simply access online demo here.

Deployment Guide

Prerequisites

Before you begin, ensure you have the following installed:

Configuration

I have already built all of required service as a docker image.

There are two deployment strategies available: using an external HTML retriever service hosted on Google Cloud Run for flexible IP rotation (docker-compose.yml) and a standalone version for local deployment (standalone-docker-compose.yml).

Environment Variables Explained

Web Service

API Service

Postgres and Redis Services

Differences between Compose Files

Deployment

To deploy the application, navigate to the directory containing your desired Docker Compose file and run:

docker-compose up -d

Replace docker-compose.yml with standalone-docker-compose.yml for local deployment.

TODO in the future

API