seomoz / url-cpp

C++ bindings for url parsing and sanitization
MIT License
19 stars 11 forks source link

url-cpp

Status: Production Team: Big Data Scope: External Open Source: No Critical: Yes

A C++ port of the URL parsing and sanitization provided by url-py

Goals

  1. Performance -- to be faster than url-py for existing Python projects.
  2. Standardization -- of how we interpret and sanitize URLs across projects.
  3. Relaxed parsing -- to accept and work with the "URLs" we see in crawling, even when malformed.

RFC Compliance

Rather than accept only RFC-compliant URLs, this library's parsing of URLs is based almost exclusively on Python's own urllib. It is relatively permissive, but we've used it quite extensively through url-py and have come to understand its interpretations.

Development

Environment

To launch the vagrant image, we only need to vagrant up (though you may have to provide a --provider flag):

vagrant up

With a running vagrant instance, you can log in and run tests:

vagrant ssh
cd /vagrant

make test

Running Tests

Tests are run with the top-level Makefile:

make test

PRs

These are not all hard-and-fast rules, but in general PRs have the following expectations:

PR reviews consider the design, organization, and functionality of the submitted code.

Commits

Certain types of changes should be made in their own commits to improve readability. When too many different types of changes happen simultaneous to a single commit, the purpose of each change is muddled. By giving each commit a single logical purpose, it is implicitly clear why changes in that commit took place.

New Features

Small new features (where small refers to the size and complexity of the change, not the impact) are often introduced in a single commit. Larger features or components might be built up piecewise, with each commit containing a single part of it (and its corresponding tests).

Bug Fixes

In general, bug fixes should come in two-commit pairs: a commit adding a failing test demonstrating the bug, and a commit making that failing test pass.

Tagging and Versioning

Whenever the version included in setup.py is changed (and it should be changed when appropriate using http://semver.org/), a corresponding tag should be created with the same version number (formatted v<version>).

git tag -a v0.1.0 -m 'Version 0.1.0

This release contains an initial working version of the `crawl` and `parse`
utilities.'

git push origin