tanaponpiti / google-search

0 stars 0 forks source link

The scrapping process can be improved #19

Open longnd opened 4 months ago

longnd commented 4 months ago

Issue

Good decision was made to scrap the keyword concurrency using goroutines as it will speed up the process. https://github.com/tanaponpiti/google-search/blob/07f6fabc905b388f7d3859a950c95d73a4edc70f/api/boothstrap/scraper.go#L73-L84

However, the entire flow still processes the keyword list synchronously

In that flow, after the user uploads the CSV file, the request must wait for all scraping processes (handled by multiple goroutines) to complete before returning the response to the user, i.e. it blocks the application from quickly response to the end user with the uploaded keyword list, while leaving the scrapping processes to continue executing in the background.

Besides that, goroutines in this case, have other limitations:

One of the possible improvements is to handle the scraping process asynchronously by:

Expected

tanaponpiti commented 4 months ago

I did not quite sure if I'm understand your concern correctly. But as you suggest in "One of the possible improvements is to handle the scraping process asynchronously by:" that is exactly what I believed how my application working right now.

So when user upload their CSV to server it will synchronously check if keyword already exist as a job that in Pending status. It will then spawn go routine for list of this search job to queue in the background of application then return those pending keyword to user immediately. As you can se while using website you can try upload 100 of keywords, it will show PENDING keyword in the table and will later change to COMPLETED for user to see their results.

This keyword even when upload concurrently by multiple users with be continue to process in https://github.com/tanaponpiti/google-search/blob/07f6fabc905b388f7d3859a950c95d73a4edc70f/api/boothstrap/scraper.go#L73 It also has limited concurrent scape feature as well.

longnd commented 4 months ago

I rechecked the implementation, the controller when calling ScrapeFromGoogleSearch() in the Keyword Service, it does not pass an external wait group, and the ScrapeFromGoogleSearch() starts the scraping process in a separate goroutine, and return immediately after the goroutine has been launched so there is no blocking.

I'm sorry for the confusion raised in the comment as I am not familiar with Golang or goroutine (never used any of them before).

However, I also want to point out the pros & cons of using goroutines vs background jobs in this case:

tanaponpiti commented 4 months ago

Reliability: If the application crashes or restarts, any in-progress goroutines (either in this Keyword Service or boothstrap/scraper.go - it should be Bootstrap, not boothstrap - are lost (correct me if I'm wrong). Background jobs offer persistence, meaning if the worker process crashes, the job can be retried or picked up by another worker without losing progress.

You are right about using background job. My implementation is only on-memory job so if service got shutdown all of pending job will be lost. Normally, I would use queue like RabbitMQ to handle a job and have additional consumer to subscribe to those queue and do the scraping process from job only. If I understand correctly there is a Golang package ,named Gocraft/work, that allow us to store job detail in Redis to process as a queue as well (I have no experience with it yet). However, due to time constrain I cut this off from project scope and put it in TODO list for future improvement. https://github.com/tanaponpiti/google-search/blob/07f6fabc905b388f7d3859a950c95d73a4edc70f/README.md?plain=1#L103

Scalability: goroutines are limited by the resources of a single machine but background Jobs can be scaled horizontally across multiple machines or services, making them better for heavy processing loads, i.e. when there are many users using the system, we only need to scale up the background job services, not the API service.

You are also right about this. Currently, in the entire system the component that already scaled is HTML retriever which deploy to google cloud run. So, basically there is multiple Puppeteer instance that perform html fetching. However, the entire process after that, like html information extraction, is not scaled but fixed to single instance of an API service. If scalable of html information extraction is a desirable feature, then using RabbitMQ and have additional services to subscribe for each job will improving the scalability greatly.