Open longnd opened 6 months ago
Thanks @longnd for your detailed feedbacks, however, I would like to debate one point: if one of the keyword in the chuck crash, the remaining keywords will not be processed
is not correct. Once a keyword failed to retrieve data, it will be saved to DB as error status. The object with error status returned at https://github.com/tmphu/nimble-be/blob/395b608063698ee5137db89a4f2536161fee673d/src/google-scraper/google-scraper.service.ts#L180. Other keywords can continue to process normally.
I have plan to improve by a separate background job to pick up error records and re-process, however, due to time constraint this was not implemented.
Thank you for the feeback. I am aware that the function has try...catch
block, but it only catch the exception. What I mean is that if anythink make that function to crash, the remaining keyword will not be handled. One of the potential risk is on the Puppeteer instance that can crash for any reason and break that function. I experienced it during processing a long list of keywords but did not capture it for your reference.
Issue
Asynchronous processing using async function The submission has the right idea to handle the scraping process asynchronously
https://github.com/tmphu/nimble-be/blob/395b608063698ee5137db89a4f2536161fee673d/src/google-scraper/google-scraper.service.ts#L64-L66
but the implementation using async function to run that
searchAndSaveResult()
method asynchronously is error-prone and has some limitationif one of the keyword in the chuck crash, the remaining keywords will not be processed
a better way is to enqueu each keyword as a separate background job, then having a worker to pick them up to process. If a job fail, it can be retried. We can also spin up multiple workers to handle the jobs in parrallel.
Puppeteer is resource intensive The solution to search on Google and scrap the result using Puppeteer isn't optimal. It requires spinning up a chrome browser (in headless mode) which consume server resources.
One of a simpler solution is using an HTTP client, e.g. Axios to make the search request directly and parse it, e.g. by using Cheerio. A simple CURL command can demonstrate the idea