Open robertknight opened 3 years ago
I've been having a think through this. Given the total volume of data in the production database and Elasticsearch index we can't recompute all normalized URIs in the database and Elasticsearch in single step. Therefore we are going to need to allow for states where changes have been made to URI normalization, but some entries in the database and/or search index still reflect the old version. When searching for entries in the database by normalized URI, we will therefore need to find matches for the normalized URI generated by both the previous and current versions of the algorithm, until the data migration is fully complete.
A sketch of how this could work:
h.util.uri.normalize
and add an optional version
arg to specify which version to use. If no version is specified, the current version is usedh.util.uri
would specify the range of live versions.h.util.uri.normalize
which can return all distinct versions of a normalized URI according to the current live versionsh.storage.expand_uri
to service a search request, consider all live normalized versions of a URIThe migration process would then work as follows:
h.util.uri.normalize
add logic that applies the new normalization steps, guarded by if version >= X
checksThe above process assumes that we only have one production database to deal with and so we can keep a record of the "oldest live version" in the code. To support executing the migration in different instances at different times we might need to persist the version in the database instead. This could be done as a single attribute for the database or on a more fine-grained basis (eg. attribute per annotation
and per document_uri
).
For working through the details of this procedure, I think that https://github.com/hypothesis/h/pull/6567 is a good use case because it is a logically simple change that affects a large but not huge number of URLs (a maximum of ~170,000). I have done a quick prototype of steps 1-3 locally. The biggest chunk of work here is to figure out the details of the data migration.
Some things we'd need to do in any data migration after changing the URI normalization algorithm:
target_uri_normalized
field of each annotation where normalize(target_uri)
!= current value of target_uri_normalized
uri_normalized
and claimant_normalized
fields of each document_uri
where normalize(uri)
!= uri_normalized
OR normalize(claimant)
!= claimant_normalized
. There is a unique constraint on (claimant_normalized, uri_normalized, type, content_type)
, so this update may fail if there are multiple DocumentURI
entries which have different normalized representations with the old algorithm but the same normalized representation with the new algorithm.For URIs which only appear as additional URIs for an annotation and not as the target_uri
(eg: doi:XYZ
URIs) ES reindexing is not needed because the search index only contains the normalized target_uri
. At search time the query URI (eg. a doi:XYZ
URI) is expanded to the corresponding target_uri
s).
I just stumbled across a normalize-uris
command in the h CLI tool which looks relevant. We probably can't use it as-is any more given the current size of the production database, but perhaps we can use it as a reference for an admin task that performs selective re-computation of the normalized URI.
I reflected on this again recently and came to the conclusion that, while being able to change the URI normalization algorithm is useful, it would be better to design the system so that we can change which parts of a URI are considered semantically meaningful (ie. they affect the returned content) without having to change the URI normalization algorithm and reindex existing data.
The rough approach I have in mind is to move the logic for ignoring query parameters that do not affect the identity of the document from annotation storage/indexing time to query time:
httpx://drive.google.com/uc?id=foobar&export=download&resourcekey=123
=> ("httpx", "drive.google.com", "/uc", "id=foobar", "export=download", "resourcekey=123")
meaningful tokens are those parts of the URL which affect the identity of the document. Initially this will be all of the URL tokens minus any query parameters that are in the current ignore list. In future we could experiment with alternative policies for determining this set, such as switching from a blocklist to an allowlist of query parameters.
Because the set of meaningful tokens is determined at search time, we can change it much more easily, without needing to modify any already indexed data.
There is a widely used URI normalization algorithm implemented by
h.util.uri.normalize
which ensures that certain differences are ignored when comparing/searching by URL in many contexts. These differences include:Periodically we find reasons that we want to change this algorithm, for example to add new parameters to the
BLACKLISTED_QUERY_PARAMS
set. A concrete example is ignoring thetoken
signed token that is added to Canvas files URLs.However we can't do this because we have no procedure or tools to help us re-process the existing URIs that are stored in the database and in Elasticsearch. As a result changing the normalization algorithm will break lookup of existing annotations unless we update references in the DB/Elasticsearch.
Normalized URLs are currently stored in the following database fields:
annotation.target_uri_normalized
document.claimant_normalized
document.uri_normalized
They are also stored in the search index in fields generated by
AnnotationSearchIndexPresenter
.This issue covers implementing and documenting tools to facilitate updating the URI normalization algorithm.