indrajithi / tiny-web-crawler

A simple and easy to use web crawler for Python
MIT License
55 stars 11 forks source link
crawler crawling python python-package python-web-crawler scraping web-crawler-python web-scraping web-scraping-python

Tiny Web Crawler

CI Coverage badge Stable Version License: MIT Download Stats Discord

A simple and efficient web crawler for Python.

Features

Installation

Install using pip:

pip install tiny-web-crawler

Usage

from tiny_web_crawler import Spider
from tiny_web_crawler import SpiderSettings

settings = SpiderSettings(
    root_url = 'http://github.com',
    max_links = 2
)

spider = Spider(settings)
spider.start()

# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0

settings = SpiderSettings(
    root_url = 'https://github.com',
    max_links = 5,
    max_workers = 5,
    delay = 1,
    verbose = False
)

spider = Spider(settings)
spider.start()

Output Format

Crawled output sample for https://github.com

{
    "http://github.com": {
        "urls": [
            "http://github.com/",
            "https://githubuniverse.com/",
            "..."
        ],
    "https://github.com/solutions/ci-cd": {
        "urls": [
            "https://github.com/solutions/ci-cd/",
            "https://githubuniverse.com/",
            "..."
        ]
      }
    }
}

Contributing

Thank you for considering to contribute.

Dev setup

Before raising a PR. Please make sure you have these checks covered