Implement a procedure for updating the URI normalization algorithm

There is a widely used URI normalization algorithm implemented by h.util.uri.normalize which ensures that certain differences are ignored when comparing/searching by URL in many contexts. These differences include:

Whether the site was visited over HTTP or HTTPS
Whether a default port was specified or not (80 vs 443)
The value of query string parameters which are known to be irrelevant to the content (eg. signatures or Google Analytics campaign metadata)

Periodically we find reasons that we want to change this algorithm, for example to add new parameters to the BLACKLISTED_QUERY_PARAMS set. A concrete example is ignoring the token signed token that is added to Canvas files URLs.

However we can't do this because we have no procedure or tools to help us re-process the existing URIs that are stored in the database and in Elasticsearch. As a result changing the normalization algorithm will break lookup of existing annotations unless we update references in the DB/Elasticsearch.

Normalized URLs are currently stored in the following database fields:

annotation.target_uri_normalized
document.claimant_normalized
document.uri_normalized

They are also stored in the search index in fields generated by AnnotationSearchIndexPresenter.

This issue covers implementing and documenting tools to facilitate updating the URI normalization algorithm.

I've been having a think through this. Given the total volume of data in the production database and Elasticsearch index we can't recompute all normalized URIs in the database and Elasticsearch in single step. Therefore we are going to need to allow for states where changes have been made to URI normalization, but some entries in the database and/or search index still reflect the old version. When searching for entries in the database by normalized URI, we will therefore need to find matches for the normalized URI generated by both the previous and current versions of the algorithm, until the data migration is fully complete.

A sketch of how this could work:

Introduce a version number for the normalization algorithm implemented by h.util.uri.normalize and add an optional version arg to specify which version to use. If no version is specified, the current version is used
Introduce the concept of live versions of the normalization algorithm which are all versions for which the data has not been fully migrated. Constants in h.util.uri would specify the range of live versions.
Add a function in h.util.uri.normalize which can return all distinct versions of a normalized URI according to the current live versions
When expanding URIs in h.storage.expand_uri to service a search request, consider all live normalized versions of a URI
When persisting and indexing annotations, always use the current version

The migration process would then work as follows:

Increment the "current" version number for the normalization algorithm, but leave the "oldest live version" number the same
In h.util.uri.normalize add logic that applies the new normalization steps, guarded by if version >= X checks
Validate that the new normalization algorithm works as expected and deploy
Implement a data migration to update existing URIs in the database and search index which are affected by the change. There are currently ~22M annotations in the production h DB and ~5M document URIs, so scoping this migration to only update affected documents and annotations will generally make it run a lot faster.
Validate, test and execute the data migration
Once the data migration is complete, increment the "oldest live version" to match the current version of the URI normalization algorithm

The above process assumes that we only have one production database to deal with and so we can keep a record of the "oldest live version" in the code. To support executing the migration in different instances at different times we might need to persist the version in the database instead. This could be done as a single attribute for the database or on a more fine-grained basis (eg. attribute per annotation and per document_uri).

For working through the details of this procedure, I think that https://github.com/hypothesis/h/pull/6567 is a good use case because it is a logically simple change that affects a large but not huge number of URLs (a maximum of ~170,000). I have done a quick prototype of steps 1-3 locally. The biggest chunk of work here is to figure out the details of the data migration.

Some things we'd need to do in any data migration after changing the URI normalization algorithm:

Update the target_uri_normalized field of each annotation where normalize(target_uri) != current value of target_uri_normalized
Re-index all annotations changed as a result of (1)
Update uri_normalized and claimant_normalized fields of each document_uri where normalize(uri) != uri_normalized OR normalize(claimant) != claimant_normalized. There is a unique constraint on (claimant_normalized, uri_normalized, type, content_type), so this update may fail if there are multiple DocumentURI entries which have different normalized representations with the old algorithm but the same normalized representation with the new algorithm.

For URIs which only appear as additional URIs for an annotation and not as the target_uri (eg: doi:XYZ URIs) ES reindexing is not needed because the search index only contains the normalized target_uri. At search time the query URI (eg. a doi:XYZ URI) is expanded to the corresponding target_uris).

I just stumbled across a normalize-uris command in the h CLI tool which looks relevant. We probably can't use it as-is any more given the current size of the production database, but perhaps we can use it as a reference for an admin task that performs selective re-computation of the normalized URI.

I reflected on this again recently and came to the conclusion that, while being able to change the URI normalization algorithm is useful, it would be better to design the system so that we can change which parts of a URI are considered semantically meaningful (ie. they affect the returned content) without having to change the URI normalization algorithm and reindex existing data.

The rough approach I have in mind is to move the logic for ignoring query parameters that do not affect the identity of the document from annotation storage/indexing time to query time:

When storing/indexing an annotation, split its normalized URL into tokens and index those. The scheme, host, each path segment and each query param would be separate tokens. eg. httpx://drive.google.com/uc?id=foobar&export=download&resourcekey=123 => ("httpx", "drive.google.com", "/uc", "id=foobar", "export=download", "resourcekey=123")
At search time, normalize the query URL and tokenize it in the same way that was done during annotation indexing. Then extract the set of meaningful tokens and search for all indexed annotations whose URIs have the same set of meaningful tokens.

meaningful tokens are those parts of the URL which affect the identity of the document. Initially this will be all of the URL tokens minus any query parameters that are in the current ignore list. In future we could experiment with alternative policies for determining this set, such as switching from a blocklist to an allowlist of query parameters.

Because the set of meaningful tokens is determined at search time, we can change it much more easily, without needing to modify any already indexed data.

hypothesis / h

Implement a procedure for updating the URI normalization algorithm #6552