Closed indrajithi closed 1 week ago
I have a question, isn't this already achievable through max_links=0
in the Spider
class?
And if not, does this mean to add an argument to Spider.__init__
which, when set to true, it'll only crawl the root website?
if we set max_link=0 it will crawl only the root_url
once.
Say for example we are passing the root_url as https://github.com
. It will crawl only this page and fetch all the links in this page. It will not crawl https://github.com/indrajithi/tiny-web-crawler
and fetch links in that. max_links
is the number of urls/links crawled.
What we want to achieve in this issue is we that, it should only crawl internal links.
Every links that has https://github.com/
in it. And do not crawl external links.
This will be useful in creating sitemap for a website. LMK if you have any more questions. @Mews
Alright makes sense. What should I call the argument then, something like crawl_external_links
? And the default would be true?
Oh wait there's already a pr open for this
Oh wait there's already a pr open for this
Would you like to pick this up? This is very similar to what we discussed.
Sure!
@devavinothm Are you working on this? https://github.com/indrajithi/tiny-web-crawler/pull/14
@indrajithi I can complete his pr if you want
@Mews I have updated the description. Assigning to you. 🥇
Thanks, I'm going to sleep right now but I'll get to it tomorrow morning :)
Very straightforward feature to add a flag to crawl only the root website and do not crawl to external links.
eg: If the root url provided is https://github.com. It should crawl pages in this domain only. It should not crawl https://exmaple.com
(optional) Can we also support an option to crawl only external links and no internal links. There could be some use cases for that