indrajithi / tiny-web-crawler

A simple and easy to use web crawler for Python
MIT License
55 stars 11 forks source link

Feature: Add a feature to only crawl the given list of urls #12

Open indrajithi opened 2 weeks ago

indrajithi commented 2 weeks ago
lodenrogue commented 2 weeks ago

Wouldn't that be set by the Spider.max_links value?

indrajithi commented 2 weeks ago

@lodenrogue Max max_links is basically the max hops the crawler will make. Let us say we start from github.com as the root url. In the first crawl we will fetch all the links in github.com and then recursively crawl all the links we fetched until max_link count is reached.

Eg: Say we found three links from the root url: [URL1, URL2, URL3] If the max link is set as 2. We will only crawl [URL1, URL2] and fetch the links in that.

This feature we are expecting the crawler to fetch the urls provided by the user and nothing more. The list of urls to crawl will be a custom set provided by the user as input. There will be no root url base crawls and hops.

C0DE-SLAYER commented 1 week ago

For example, url_list = [URL1, URL2, URL3], so we will loop through this url_list and fetch the link, but there will be no root url. If I am getting it right, I would love to solve it and ask for the assignment of this issue to me.

indrajithi commented 1 week ago

Hi @C0DE-SLAYER.

Please let us know if you are working on this?

C0DE-SLAYER commented 1 week ago

@indrajithi yes I will open a PR by today