A simple and efficient web crawler for Python.
Install using pip:
pip install tiny-web-crawler
from tiny_web_crawler import Spider
from tiny_web_crawler import SpiderSettings
settings = SpiderSettings(
root_url = 'http://github.com',
max_links = 2
)
spider = Spider(settings)
spider.start()
# Set workers and delay (default: delay is 0.5 sec and verbose is True)
# If you do not want delay, set delay=0
settings = SpiderSettings(
root_url = 'https://github.com',
max_links = 5,
max_workers = 5,
delay = 1,
verbose = False
)
spider = Spider(settings)
spider.start()
Crawled output sample for https://github.com
{
"http://github.com": {
"urls": [
"http://github.com/",
"https://githubuniverse.com/",
"..."
],
"https://github.com/solutions/ci-cd": {
"urls": [
"https://github.com/solutions/ci-cd/",
"https://githubuniverse.com/",
"..."
]
}
}
}
Thank you for considering to contribute.
good-first-issue
and get started.issue
and see if anything interests you.pipx install poetry
poetry shell
poetry install --with dev
pre-commit install
(see)pre-commit install --hook-type pre-push