Write external import script for CVS repositories

Based on the excellent work of @christinelovett and @abeatrix, we need to write a script that allows CVS repositories to be imported into Sourcegraph.

Requirements

At least partial history
Able to scale to hundred+ gigabyte scale repositories with reasonable resource usage (run time in the hours or days, not weeks; no exotic requirements around memory or storage)
Able to be re-run to sync with new commits incrementally, rather than doing a full reimport
Able to filter included content based on file name
Able to filter branches

Prior art

There are three tools/workflows I know of that allow for CVS-to-Git conversion:

git-cvsimport: this tool is bundled with Git itself, but depends on cvsps version 2 being available to do patchset detection. This tool supports incremental update, but is slow (as it shells out to run git for each commit), and suffers from cvsps's limitations around what it considers to be a valid repository. The performance concerns make it a non-starter.
cvs-fast-export: this tool is based on earlier tools to parse and export CVS repositories, and generates a stream of data that can be imported by git-fast-import. In practice, while faster than git-cvsimport, this tool also suffers from significant scaling issues due to the way it detects patchsets and branches, has a tendency to error on CVS repositories with messier branch/tag histories that can't be represented sensibly in Git, and doesn't support incremental updates.
Combining cvs2svn with one of the many tools to convert a Subversion repository to Git (most likely svn2git). cvs2svn doesn't have a concept of incremental updates, although practically speaking, it might be OK to just do the conversion each time. However, this relies on us being able to retain enough history and structure to be useful across two leaky abstractions, which feels dangerous.

Preferred method

I believe there's a path to implementing a tool with the desired properties. The key is that we can use git-fast-import to perform a full import with only one pass over the CVSROOT:

Add all file revisions as blobs, not bothering to note if we actually need them or not.
Simultaneously build a map of file commits in an ordered map (author, commit) => (file, time, revision), with the order taken from the RCS revision IDs.
Split map values into buckets based on the closeness of their commit times, as cvsps does.
Retrieve patchsets in order with author, commit, files; tracking the previous patchset as the parent commit, we can use filedeleteall when constructing git-fast-import commits to allow Git to figure out history.

If we store and retrieve marks and inferred patchsets, we can also do this incrementally.

Milestones

To be turned into issues:

HEAD only import
- RCS parsing: done
- git-fast-import support: done
- Patchset detection: 1d
- More forward ed testing: 1d
Branch support: 2d
Persist state for incremental updates: 2-3d

sourcegraph / sourcegraph-public-snapshot