The scrapping process can be improved

Issue

Good decision was made to scrap the keyword concurrency using goroutines as it will speed up the process. https://github.com/tanaponpiti/google-search/blob/07f6fabc905b388f7d3859a950c95d73a4edc70f/api/boothstrap/scraper.go#L73-L84

However, the entire flow still processes the keyword list synchronously

Step 1: KeywordController -> AddKeywordFromCSV() parse the CSV file to get the keyword list
Step 2: KeywordService -> ScrapeFromGoogleSearch(): create the keyword and its search results records in the DB
Step 3: boothstrap/Scapper -> ScrapeFromGoogleSearch(): dispatch multiple goroutines to scrap each keyword.

In that flow, after the user uploads the CSV file, the request must wait for all scraping processes (handled by multiple goroutines) to complete before returning the response to the user, i.e. it blocks the application from quickly response to the end user with the uploaded keyword list, while leaving the scrapping processes to continue executing in the background.

Besides that, goroutines in this case, have other limitations:

It doesn't allow to automatically retry on failure tasks.
It doesn't allow distribute the workload, e.g. if we have many users, it is hard to distribute the workload to multiple background workers.
Longevity of tasks (e.g. if the scraping takes a long time), it will force the user to wait.

One of the possible improvements is to handle the scraping process asynchronously by:

Parsing the CSV file
Store the keyword list in the DB, with their processing statuses, e.g. initialized, processing, completed
Enqueue each of the keywords as a background job
Return the uploaded keyword list to the user users. They will see the keywords and their processing status.
A worker (or multiple workers) pick up the background jobs to process

Expected

The scrapping process is handled asynchronously
The users quickly get the uploaded keyword list, without waiting for the scrapping process to complete

I did not quite sure if I'm understand your concern correctly. But as you suggest in "One of the possible improvements is to handle the scraping process asynchronously by:" that is exactly what I believed how my application working right now.

Parsing the CSV file (synchronously)
Store the keyword list in the DB, with their processing statuses, e.g. initialized, processing, completed (synchronously)
Enqueue each of the keywords as a background job (asynchronously)
Return the uploaded keyword list to the user users. They will see the keywords and their processing status. (synchronously)
A worker (or multiple workers) pick up the background jobs to process(asynchronously)

So when user upload their CSV to server it will synchronously check if keyword already exist as a job that in Pending status. It will then spawn go routine for list of this search job to queue in the background of application then return those pending keyword to user immediately. As you can se while using website you can try upload 100 of keywords, it will show PENDING keyword in the table and will later change to COMPLETED for user to see their results.

This keyword even when upload concurrently by multiple users with be continue to process in https://github.com/tanaponpiti/google-search/blob/07f6fabc905b388f7d3859a950c95d73a4edc70f/api/boothstrap/scraper.go#L73 It also has limited concurrent scape feature as well.

I rechecked the implementation, the controller when calling ScrapeFromGoogleSearch() in the Keyword Service, it does not pass an external wait group, and the ScrapeFromGoogleSearch() starts the scraping process in a separate goroutine, and return immediately after the goroutine has been launched so there is no blocking.

I'm sorry for the confusion raised in the comment as I am not familiar with Golang or goroutine (never used any of them before).

However, I also want to point out the pros & cons of using goroutines vs background jobs in this case:

Reliability: If the application crashes or restarts, any in-progress goroutines (either in this Keyword Service or boothstrap/scraper.go - it should be Bootstrap, not boothstrap - are lost (correct me if I'm wrong). Background jobs offer persistence, meaning if the worker process crashes, the job can be retried or picked up by another worker without losing progress.
Scalability: goroutines are limited by the resources of a single machine but background Jobs can be scaled horizontally across multiple machines or services, making them better for heavy processing loads, i.e. when there are many users using the system, we only need to scale up the background job services, not the API service.

Reliability: If the application crashes or restarts, any in-progress goroutines (either in this Keyword Service or boothstrap/scraper.go - it should be Bootstrap, not boothstrap - are lost (correct me if I'm wrong). Background jobs offer persistence, meaning if the worker process crashes, the job can be retried or picked up by another worker without losing progress.

You are right about using background job. My implementation is only on-memory job so if service got shutdown all of pending job will be lost. Normally, I would use queue like RabbitMQ to handle a job and have additional consumer to subscribe to those queue and do the scraping process from job only. If I understand correctly there is a Golang package ,named Gocraft/work, that allow us to store job detail in Redis to process as a queue as well (I have no experience with it yet). However, due to time constrain I cut this off from project scope and put it in TODO list for future improvement. https://github.com/tanaponpiti/google-search/blob/07f6fabc905b388f7d3859a950c95d73a4edc70f/README.md?plain=1#L103

Scalability: goroutines are limited by the resources of a single machine but background Jobs can be scaled horizontally across multiple machines or services, making them better for heavy processing loads, i.e. when there are many users using the system, we only need to scale up the background job services, not the API service.

You are also right about this. Currently, in the entire system the component that already scaled is HTML retriever which deploy to google cloud run. So, basically there is multiple Puppeteer instance that perform html fetching. However, the entire process after that, like html information extraction, is not scaled but fixed to single instance of an API service. If scalable of html information extraction is a desirable feature, then using RabbitMQ and have additional services to subscribe for each job will improving the scalability greatly.

tanaponpiti / google-search

The scrapping process can be improved #19

Issue

Expected