[Question] Scalability of spider-api

Given the facts, currently, 10K keywords will take around 3 hours to process. As of now, we are using EC2 t2.larage. I uploaded 100 keywords and it took 1 minute and 57 seconds to process 100 keywords (including 2 to 4 seconds of random delay to avoid bot detection). The Headless Chromium Process uses 32.8%, and another instance uses 19.0%. Currently, we are using 2 max concurrency modes. We can increase to 4 max modes with two EC2. CPU will be around 70% and 1.2GB will be required to use. This will cost us 2 instances × $0.0928 per hour = $0.1856 per hours.

We could deploy the spider API with the following setup to consume fewer resources to run.

Deploy on AWS Lamda function.
Adjust our cluster worker on demand
Cached scraped query to reduce load.

From deploying to EC2 to deploying to AWS Lamda as a cloud function will reduce the resources needed during the idle stage.

mgmgpyaesonewin / web-crawler-assignment

[Question] Scalability of spider-api #11