Insufficient approach for scrapping process

longnd commented 6 months ago

Issue

Asynchronous processing using async function The submission has the right idea to handle the scraping process asynchronously

https://github.com/tmphu/nimble-be/blob/395b608063698ee5137db89a4f2536161fee673d/src/google-scraper/google-scraper.service.ts#L64-L66

but the implementation using async function to run that searchAndSaveResult() method asynchronously is error-prone and has some limitation

scalability: since the keywords are processed in a chunk, sequentally, we can't improve the processing time by processing multipe keywords in parallel
if one of the keyword in the chuck crash, the remaining keywords will not be processed

a better way is to enqueu each keyword as a separate background job, then having a worker to pick them up to process. If a job fail, it can be retried. We can also spin up multiple workers to handle the jobs in parrallel.

Puppeteer is resource intensive The solution to search on Google and scrap the result using Puppeteer isn't optimal. It requires spinning up a chrome browser (in headless mode) which consume server resources.

One of a simpler solution is using an HTTP client, e.g. Axios to make the search request directly and parse it, e.g. by using Cheerio. A simple CURL command can demonstrate the idea

curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3" https://www.google.com/search\?q\=nodejs

tmphu commented 6 months ago

Thanks @longnd for your detailed feedbacks, however, I would like to debate one point: if one of the keyword in the chuck crash, the remaining keywords will not be processed is not correct. Once a keyword failed to retrieve data, it will be saved to DB as error status. The object with error status returned at https://github.com/tmphu/nimble-be/blob/395b608063698ee5137db89a4f2536161fee673d/src/google-scraper/google-scraper.service.ts#L180. Other keywords can continue to process normally.

I have plan to improve by a separate background job to pick up error records and re-process, however, due to time constraint this was not implemented.

longnd commented 6 months ago

Thank you for the feeback. I am aware that the function has try...catch block, but it only catch the exception. What I mean is that if anythink make that function to crash, the remaining keyword will not be handled. One of the potential risk is on the Puppeteer instance that can crash for any reason and break that function. I experienced it during processing a long list of keywords but did not capture it for your reference.

tmphu / nimble-be

Insufficient approach for scrapping process #10

Issue