Automatically search for multiple copies of the same document, and find equivalences

Discussion moved from https://github.com/hypothesis/h/issues/772.

This is an alternative approach to #7.

Problem statement (which is already explained in #7):

There are cases when the same document appears online under many different URLs, and we need a way to make annotations transparent between those copies. For this, we need to declare the equivalence of those documents in our DB.

Currently, we only detect document equivalence based on various metadata, but there are cases when this does not work. We would like to cover those cases, too.

One way to do this manually; see #7. The other way to do it is automatic; that's what this issue is for.

We could discover these equivalences automatically by collecing all visited URLs, and examining their content, and looking for equivalences in the background.

The first question is how we detect equivalence? There might be some different headers/footers around the same content, so we need to be somewhat flexible; simple checksumming is not sufficient. But maybe we could be smart about it. I have a friend who used to work on something very similar. Their KOPI service has a API that we could probably use for this, if we wanted to.

The other question is, what should we do with the automatically found equivalences?

We could just declare them automatically. That would be convenient, but could theoretically result in some false positives.
We could also randomly ask users to review them; that could reduce the rate of half positives, but would be somewhat annoying.
We could also use paid labor for verifying the results (for example via mturk or something similar), but currently we don't have the funds for that.

For more background information (and some ranting), see #7.

hypothesis / vision

Automatically search for multiple copies of the same document, and find equivalences #8