php1301 / NBTechInterview

MIT License
2 stars 0 forks source link

[Feature] Processing scraping of keywords asynchronously #5

Open olivierobert opened 1 year ago

olivierobert commented 1 year ago

Issue

Upon uploading the file with keywords, the processing of the latter is done in a loop synchronously, i.e., the code loop through each keyword, scrap the Google page for the content, then insert records in the database:

https://github.com/php1301/NBTechInterview/blob/26e25361c17bf4bab26e3b457be1d8d3d3e4e1ce/pages/api/search/index.ts#L27-L43

However, scraping Google is time-consuming and error-prone. In the current implementation, even when offloading the scraping to a Lambda function, the scraping of keywords is synchronous, hence can lead to a long wait for users. In addition, if any error happens, the whole list of keywords fails.

Expected

The latter should be inserted into the database after uploading the file with keywords. There should be a status attribute to track the scraping status. Then, there should be a process to trigger the scraping of each keyword separately. The scraping outcome of one keyword should not affect the outcome for the other. As a result, some keywords could be processed successfully while others could fail. That is okay :-)

php1301 commented 1 year ago

Will try to implement an Asynchronous execution there. At the first glance, I intended to use Lambda Asynchronous but from what I've tried, all concurrency share the same instance for executing so I will see what I can do. Thanks

php1301 commented 1 year ago

Above is the best effort when I refactor my code to be more asynchronous, some problems appeared along the way, and my solutions:

  1. The serverless logic for rotating lambda proxy -> basically I will update the function configuration like Env Variables but when parallel processing with just one Lambda function, the lambda sometimes will be InProgress state and not ready for invocation -> Solution: I have to block it for a while with while loop and status flag extracted using aws-sdk.
  1. Timeout for API Gateway, API gateway has a maximum timeout of 29 seconds, which is not suitable for this parallel processing even in the background, so for long-processing workloads like this, I switched to using Lambda Function Invocation, which increased the timeout to a maximum of 15 minutes.

  2. jsonblob 524 - This is used for generating a link to view html_code. Can't really control this, it happens inconsistently. Maybe in an actual project, we may have some dedicated service for this or even create a mini <canvas> code.

    • image
  3. Promise.allSettled for ignoring the failed keywords and filter it to UI (added a loading indication too)

From my perspectives, I do think Scraping is a Daily Long-processing CRON job, which means that I guess we can go a little bit easy on the processing speed and user testimonials like UI/UX for loading indication. Maybe we can boot a quick and low-end Spot Instance for cost saving and terminate it with a maximum of 1/3, 1/2 of the day.

olivierobert commented 1 year ago

The background processing would require setting up a queue system. https://github.com/OptimalBits/bull is an oft-used solution in Node projects. Lambda could still handle how the scraping is processed, but the application code must handle queuing background jobs.

Implementation-wise, there could be. running worker to check the queue and run the jobs immediately. Or there could be a CRON job to periiodically check the jobs to run. Either way would work.