indrajithi / tiny-web-crawler

A simple and easy to use web crawler for Python
MIT License
55 stars 11 forks source link

Feature: Support flag to crawl only the root website. Do not hop to external links #11

Closed indrajithi closed 1 week ago

indrajithi commented 2 weeks ago
Mews commented 1 week ago

I have a question, isn't this already achievable through max_links=0 in the Spider class? And if not, does this mean to add an argument to Spider.__init__ which, when set to true, it'll only crawl the root website?

indrajithi commented 1 week ago

if we set max_link=0 it will crawl only the root_url once.

Say for example we are passing the root_url as https://github.com. It will crawl only this page and fetch all the links in this page. It will not crawl https://github.com/indrajithi/tiny-web-crawler and fetch links in that. max_links is the number of urls/links crawled.

What we want to achieve in this issue is we that, it should only crawl internal links.

Every links that has https://github.com/ in it. And do not crawl external links.

This will be useful in creating sitemap for a website. LMK if you have any more questions. @Mews

Mews commented 1 week ago

Alright makes sense. What should I call the argument then, something like crawl_external_links? And the default would be true?

Mews commented 1 week ago

Oh wait there's already a pr open for this

indrajithi commented 1 week ago

Oh wait there's already a pr open for this

Would you like to pick this up? This is very similar to what we discussed.

Mews commented 1 week ago

Sure!

indrajithi commented 1 week ago

@devavinothm Are you working on this? https://github.com/indrajithi/tiny-web-crawler/pull/14

Mews commented 1 week ago

@indrajithi I can complete his pr if you want

indrajithi commented 1 week ago

@Mews I have updated the description. Assigning to you. 🥇

Mews commented 1 week ago

Thanks, I'm going to sleep right now but I'll get to it tomorrow morning :)