hypothesis / h

Annotate with anyone, anywhere.
https://hypothes.is/
BSD 2-Clause "Simplified" License
2.96k stars 427 forks source link

Migrate incorrect Via 3 PDF document URLs #6566

Open robertknight opened 3 years ago

robertknight commented 3 years ago

As a result of https://github.com/hypothesis/via3/issues/372 all existing annotations created on PDFs using Via3 before 2021-04-06 have an incorrect target_uri field for the annotation. Instead of being the original URL of the PDF it is Via3 URL which proxies the PDF. In other words rather than https://example.com/file.pdf the target_uri is https://via3.hypothes.is/proxy/static/https://example.com/file.pdf or similar. We went through several different domains while working on a replacement for Via, including at least via3.hypothes.is and via4.hypothes.is.

This issue was being fixed by https://github.com/hypothesis/via3/pull/427 but existing annotations will need to be updated via a data migration or some other mechanism.

Fixing this issue will resolve a problem where the document details show up incorrectly in the Notebook and activity pages (https://hypothes.is/search) for annotations created with Via 3:

Notebook wrong PDF URL

Be aware that it is not just the annotation.target_uri field which matters here. There are corresponding fields in the document_uri and document tables as well.

robertknight commented 3 years ago

CC @esanzgar as we talked about this on Slack recently.

robertknight commented 3 years ago

One way that you might think we could fix this is by updating the URI normalization algorithm, which already does de-Via-fication for legacy Via URLs. However changing the algorithm doesn't update existing document URIs and more generally we don't have a procedure to do that. See https://github.com/hypothesis/h/issues/6552.

robertknight commented 3 years ago

Querying a random sample of the annotation table in production results in an estimate of ~4M annotations that have URLs referencing via3.hypothes.is or via4.hypothes.is.

select 100 * count(*) from annotation tablesample system(1) where target_uri LIKE 'https://via3.hypothes.is/%' OR target_uri LIKE 'https://via4.hypothes.is/%'