mgmgpyaesonewin / web-crawler-assignment

0 stars 0 forks source link

[Question] Scalability of spider-api #11

Open malparty opened 3 months ago

malparty commented 3 months ago

Using puppeteer to scrap data is interesting for the stealth approach. But I'm concerned about how scalable this is. Could you share the costs to run it on AWS for 10K keywords per hour? (which is still a small amount)?

Can you think of another way that could consume less resources to run?

mgmgpyaesonewin commented 3 months ago

Given the facts, currently, 10K keywords will take around 3 hours to process. As of now, we are using EC2 t2.larage. I uploaded 100 keywords and it took 1 minute and 57 seconds to process 100 keywords (including 2 to 4 seconds of random delay to avoid bot detection). The Headless Chromium Process uses 32.8%, and another instance uses 19.0%. Currently, we are using 2 max concurrency modes. We can increase to 4 max modes with two EC2. CPU will be around 70% and 1.2GB will be required to use. This will cost us 2 instances × $0.0928 per hour = $0.1856 per hours.

Screen Shot 2024-03-15 at 11 43 25 PM

We could deploy the spider API with the following setup to consume fewer resources to run.

From deploying to EC2 to deploying to AWS Lamda as a cloud function will reduce the resources needed during the idle stage.