url-cpp
A C++ port of the URL parsing and sanitization provided by
url-py
Goals
- Performance -- to be faster than
url-py
for existing Python projects.
- Standardization -- of how we interpret and sanitize URLs across projects.
- Relaxed parsing -- to accept and work with the "URLs" we see in crawling, even when
malformed.
RFC Compliance
Rather than accept only RFC-compliant URLs, this library's parsing of URLs is based almost
exclusively on Python's own urllib
. It is relatively permissive, but we've used it quite
extensively through url-py
and have come to understand its interpretations.
Development
Environment
To launch the vagrant
image, we only need to
vagrant up
(though you may have to provide a --provider
flag):
vagrant up
With a running vagrant
instance, you can log in and run tests:
vagrant ssh
cd /vagrant
make test
Running Tests
Tests are run with the top-level Makefile
:
make test
PRs
These are not all hard-and-fast rules, but in general PRs have the following expectations:
- pass Travis -- or more generally, whatever CI is used for the particular project
- be a complete unit -- whether a bug fix or feature, it should appear as a complete
unit before consideration.
- maintain code coverage -- some projects may include code coverage requirements as
part of the build as well
- maintain the established style -- this means the existing style of established
projects, the established conventions of the team for a given language on new
projects, and the guidelines of the community of the relevant languages and
frameworks.
- include failing tests -- in the case of bugs, failing tests demonstrating the bug
should be included as one commit, followed by a commit making the test succeed. This
allows us to jump to a world with a bug included, and prove that our test in fact
exercises the bug.
- be reviewed by one or more developers -- not all feedback has to be accepted, but
it should all be considered.
- avoid 'addressed PR feedback' commits -- in general, PR feedback should be rebased
back into the appropriate commits that introduced the change. In cases, where this
is burdensome, PR feedback commits may be used but should still describe the changed
contained therein.
PR reviews consider the design, organization, and functionality of the submitted code.
Commits
Certain types of changes should be made in their own commits to improve readability. When
too many different types of changes happen simultaneous to a single commit, the purpose of
each change is muddled. By giving each commit a single logical purpose, it is implicitly
clear why changes in that commit took place.
- updating / upgrading dependencies -- this is especially true for invocations like
bundle update
or berks update
.
- introducing a new dependency -- often preceeded by a commit updating existing
dependencies, this should only include the changes for the new dependency.
- refactoring -- these commits should preserve all the existing functionality and
merely update how it's done.
- utility components to be used by a new feature -- if introducing an auxiliary class
in support of a subsequent commit, add this new class (and its tests) in its own
commit.
- config changes -- when adjusting configuration in isolation
- formatting / whitespace commits -- when adjusting code only for stylistic purposes.
New Features
Small new features (where small refers to the size and complexity of the change, not the
impact) are often introduced in a single commit. Larger features or components might be
built up piecewise, with each commit containing a single part of it (and its corresponding
tests).
Bug Fixes
In general, bug fixes should come in two-commit pairs: a commit adding a failing test
demonstrating the bug, and a commit making that failing test pass.
Tagging and Versioning
Whenever the version included in setup.py
is changed (and it should be changed when
appropriate using http://semver.org/), a corresponding tag should
be created with the same version number (formatted v<version>
).
git tag -a v0.1.0 -m 'Version 0.1.0
This release contains an initial working version of the `crawl` and `parse`
utilities.'
git push origin