Open robertknight opened 3 years ago
CC @esanzgar as we talked about this on Slack recently.
One way that you might think we could fix this is by updating the URI normalization algorithm, which already does de-Via-fication for legacy Via URLs. However changing the algorithm doesn't update existing document URIs and more generally we don't have a procedure to do that. See https://github.com/hypothesis/h/issues/6552.
Querying a random sample of the annotation
table in production results in an estimate of ~4M annotations that have URLs referencing via3.hypothes.is or via4.hypothes.is.
select 100 * count(*) from annotation tablesample system(1) where target_uri LIKE 'https://via3.hypothes.is/%' OR target_uri LIKE 'https://via4.hypothes.is/%'
As a result of https://github.com/hypothesis/via3/issues/372 all existing annotations created on PDFs using Via3 before 2021-04-06 have an incorrect
target_uri
field for the annotation. Instead of being the original URL of the PDF it is Via3 URL which proxies the PDF. In other words rather thanhttps://example.com/file.pdf
thetarget_uri
ishttps://via3.hypothes.is/proxy/static/https://example.com/file.pdf
or similar. We went through several different domains while working on a replacement for Via, including at least via3.hypothes.is and via4.hypothes.is.This issue was being fixed by https://github.com/hypothesis/via3/pull/427 but existing annotations will need to be updated via a data migration or some other mechanism.
Fixing this issue will resolve a problem where the document details show up incorrectly in the Notebook and activity pages (https://hypothes.is/search) for annotations created with Via 3:
Be aware that it is not just the
annotation.target_uri
field which matters here. There are corresponding fields in thedocument_uri
anddocument
tables as well.