mgmgpyaesonewin / web-crawler-assignment

0 stars 0 forks source link

[Question] Project planning #9

Open malparty opened 3 months ago

malparty commented 3 months ago

I could not find the issues/stories related to the different features to be implemented.

This is not a problem, but I'd be happy to learn more about it: Could you share how the work was prioritized and followed up? How did you split the work and why did you start with the spider-api? :)

mgmgpyaesonewin commented 3 months ago

Hi @malparty,

The work was prioritized based on the most critical items and followed up on related dependencies. I noted down all the features, scope of work, and dependencies on each feature based on the requirements and given time frame. Once I found out, I started working on spider API since this is the main requirement and core feature.

To select Spider to crawl, I needed to test and make a decision on which approach and use of language and framework would provide me the best outcome. Working on a crawler is the most important item. I did some research and analysis before choosing the Puppeteer for crawling. I tested with https://roach-php.dev/ as well. In terms of scalability, performance, hiding from bot detection, respecting robot.txt for crawling frequency and manner in terms of user agents, browser height, and stealth mode all these things are needed to decide ahead. So, I chose the puppeteer and started to work on it. The rest of the tickets should be supporting tools for the project.

I also split the work not only in terms of features but also in terms of separate services. The core (our spider API) should be independent and have the ability to scale up accordingly. So, I started as the puppeteer in single mode. Later, once the crawling was a success, I worked on the detection part. After that, I developed to run in cluster mode so that we can allow to crawl multiple items at once. This is for the spider API part.

Then, I designed for sending keywords to crawl and storing our crawled results. I chose the Laravel framework for backend API since I need to use a queue mechanism using AWS SQS for initiating our keywords. I choose the SQS because instead of initiating our spider with node.js API, it would be much better in terms of performance. AWS SQS will queue each request and send keywords to the puppeteer cluster queue. Hence allowing the system to run asynchronously.

After that, I worked on callback API on Laravel to store our results. Why I chose it in terms of callback instead push AWS SQS is the size of the payload. The size of the message in AWS SQS is limited to 256 KB for message payload. This could fail if the cached content of DOM elements is large.

Once I set up and wrapped up our core requirements, I moved to application features. I used next js, tailwind CSS for frontend and auth session, and storing on Laravel backend API.

Overall, I started on spider API since it is the critical part of the application. :)