Robots.txt
parsing in Python.
robots.txt
s, including
checking cache-control
and expires
headersCrawl-Delay
and Sitemaps
robots.txt
responsesreppy
is available on pypi
:
pip install reppy
When installing from source, there are submodule dependencies that must also be fetched:
git submodule update --init --recursive
make install
Two classes answer questions about whether a URL is allowed: Robots
and
Agent
:
from reppy.robots import Robots
# This utility uses `requests` to fetch the content
robots = Robots.fetch('http://example.com/robots.txt')
robots.allowed('http://example.com/some/path/', 'my-user-agent')
# Get the rules for a specific agent
agent = robots.agent('my-user-agent')
agent.allowed('http://example.com/some/path/')
The Robots
class also exposes properties expired
and ttl
to describe how
long the response should be considered valid. A reppy.ttl
policy is used to
determine what that should be:
from reppy.ttl import HeaderWithDefaultPolicy
# Use the `cache-control` or `expires` headers, defaulting to a 30 minutes and
# ensuring it's at least 10 minutes
policy = HeaderWithDefaultPolicy(default=1800, minimum=600)
robots = Robots.fetch('http://example.com/robots.txt', ttl_policy=policy)
The fetch
method accepts *args
and **kwargs
that are passed on to requests.get
,
allowing you to customize the way the fetch
is executed:
robots = Robots.fetch('http://example.com/robots.txt', headers={...})
Both *
and $
are supported for wildcard matching.
This library follows the matching 1996 RFC describes. In the case where multiple rules match a query, the longest rules wins as it is presumed to be the most specific.
The Robots
class also lists the sitemaps that are listed in a robots.txt
# This property holds a list of URL strings of all the sitemaps listed
robots.sitemaps
The Crawl-Delay
directive is per agent and can be accessed through that class. If
none was specified, it's None
:
# What's the delay my-user-agent should use
robots.agent('my-user-agent').delay
robots.txt
URLGiven a URL, there's a utility to determine the URL of the corresponding robots.txt
.
It preserves the scheme and hostname and the port (if it's not the default port for the
scheme).
# Get robots.txt URL for http://userinfo@example.com:8080/path;params?query#fragment
# It's http://example.com:8080/robots.txt
Robots.robots_url('http://userinfo@example.com:8080/path;params?query#fragment')
There are two cache classes provided -- RobotsCache
, which caches entire reppy.Robots
objects, and AgentCache
, which only caches the reppy.Agent
relevant to a client. These
caches duck-type the class that they cache for the purposes of checking if a URL is
allowed:
from reppy.cache import RobotsCache
cache = RobotsCache(capacity=100)
cache.allowed('http://example.com/foo/bar', 'my-user-agent')
from reppy.cache import AgentCache
cache = AgentCache(agent='my-user-agent', capacity=100)
cache.allowed('http://example.com/foo/bar')
Like reppy.Robots.fetch
, the cache constructory accepts a ttl_policy
to inform the
expiration of the fetched Robots
objects, as well as *args
and **kwargs
to be passed
to reppy.Robots.fetch
.
There's a piece of classic caching advice: "don't cache failures." However, this is not always appropriate in certain circumstances. For example, if the failure is a timeout, clients may want to cache this result so that every check doesn't take a very long time.
To this end, the cache
module provides a notion of a cache policy. It determines what
to do in the case of an exception. The default is to cache a form of a disallowed response
for 10 minutes, but you can configure it as you see fit:
# Do not cache failures (note the `ttl=0`):
from reppy.cache.policy import ReraiseExceptionPolicy
cache = AgentCache('my-user-agent', cache_policy=ReraiseExceptionPolicy(ttl=0))
# Cache and reraise failures for 10 minutes (note the `ttl=600`):
cache = AgentCache('my-user-agent', cache_policy=ReraiseExceptionPolicy(ttl=600))
# Treat failures as being disallowed
cache = AgentCache(
'my-user-agent',
cache_policy=DefaultObjectPolicy(ttl=600, lambda _: Agent().disallow('/')))
A Vagrantfile
is provided to bootstrap a development environment:
vagrant up
Alternatively, development can be conducted using a virtualenv
:
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
Tests may be run in vagrant
:
make test
To launch the vagrant
image, we only need to
vagrant up
(though you may have to provide a --provider
flag):
vagrant up
With a running vagrant
instance, you can log in and run tests:
vagrant ssh
make test
Tests are run with the top-level Makefile
:
make test
These are not all hard-and-fast rules, but in general PRs have the following expectations:
PR reviews consider the design, organization, and functionality of the submitted code.
Certain types of changes should be made in their own commits to improve readability. When too many different types of changes happen simultaneous to a single commit, the purpose of each change is muddled. By giving each commit a single logical purpose, it is implicitly clear why changes in that commit took place.
bundle update
or berks update
.Small new features (where small refers to the size and complexity of the change, not the impact) are often introduced in a single commit. Larger features or components might be built up piecewise, with each commit containing a single part of it (and its corresponding tests).
In general, bug fixes should come in two-commit pairs: a commit adding a failing test demonstrating the bug, and a commit making that failing test pass.
Whenever the version included in setup.py
is changed (and it should be changed when
appropriate using http://semver.org/), a corresponding tag should
be created with the same version number (formatted v<version>
).
git tag -a v0.1.0 -m 'Version 0.1.0
This release contains an initial working version of the `crawl` and `parse`
utilities.'
git push --tags origin