hypothesis / vision

Envisioning the future of the Hypothesis.
https://github.com/hypothesis/vision/issues/
40 stars 8 forks source link

Automatically search for multiple copies of the same document, and find equivalences #8

Closed csillag closed 5 years ago

csillag commented 10 years ago

Discussion moved from https://github.com/hypothesis/h/issues/772.


This is an alternative approach to #7.

Problem statement (which is already explained in #7):

There are cases when the same document appears online under many different URLs, and we need a way to make annotations transparent between those copies. For this, we need to declare the equivalence of those documents in our DB.

Currently, we only detect document equivalence based on various metadata, but there are cases when this does not work. We would like to cover those cases, too.

One way to do this manually; see #7. The other way to do it is automatic; that's what this issue is for.

We could discover these equivalences automatically by collecing all visited URLs, and examining their content, and looking for equivalences in the background.

The first question is how we detect equivalence? There might be some different headers/footers around the same content, so we need to be somewhat flexible; simple checksumming is not sufficient. But maybe we could be smart about it. I have a friend who used to work on something very similar. Their KOPI service has a API that we could probably use for this, if we wanted to.

The other question is, what should we do with the automatically found equivalences?

For more background information (and some ranting), see #7.