hypothesis / support-legacy

a place for tracking support-related work and projects
3 stars 0 forks source link

[HTML & API] Incorrect title and incorrect URI lookup with https://www.rfc-editor.org URL #256

Open kael opened 2 years ago

kael commented 2 years ago

Describe the bug

To Reproduce Steps to reproduce the behavior:

  1. Go to https://www.rfc-editor.org/rfc/rfc8984.html
  2. Click on the bookmarklet
  3. Notice orphan annotations
  4. Bookmark the RFC
  5. The bookmarked RFC is titled "RFC 8783: Distributed Denial-of-Service Open Threat Signaling (DOTS) Data Channel Specification" instead of "RFC 8984 JSCalendar: A JSON Representation of Calendar Data"
  6. API searching for annotations with the RFC URL returns annotations with other RFC (see annotation.uri and annotation.target.source[0]): https://api.hypothes.is/api/search?uri=https://www.rfc-editor.org/rfc/rfc8984.html

Expected behavior

Additional comment

The HTML code of the page doesn't contain a canonical URL but the following metadata:

<meta content="8984" name="rfc.number">
<!-- Generator version information:  ... -->
<link href="rfc8984.xml" rel="alternate" type="application/rfc+xml">
<link href="#copyright" rel="license">
<!-- ... -->
<link href="https://dx.doi.org/10.17487/rfc8984" rel="alternate">
<link href="urn:issn:2070-1721" rel="alternate">
<link href="https://datatracker.ietf.org/doc/draft-ietf-calext-jscalendar-32" rel="prev">
robertknight commented 2 years ago

From the above HTML snippet, it looks like this one is the same across different RFCs:

<link href="urn:issn:2070-1721" rel="alternate">

This ISSN refers to the whole collection of RFCs. However I believe Hypothesis treats such links as alternate URIs for the specific URI, and so it creates an equivalence relation between the specific RFC you are annotating and this ISSN. Since a link to this same identifier is created for all RFCs, fetching annotations for one RFC will return entries for others.

Since ISSNs in general refer to ongoing publications or collections rather than specific articles, we could treat them specially.

kael commented 2 years ago

Since ISSNs in general refer to ongoing publications or collections rather than specific articles, we could treat them specially.

:+1:

As an aside, is there any internal tool for fixing wrong annotations metadata (like document title) once a fixed is available, or would current annotations stay the same and the fix would only by applied to future URIs ?

robertknight commented 2 years ago

As an aside, is there any internal tool for fixing wrong annotations metadata (like document title) once a fixed is available

No, I'm afraid not. We've been discussing internally that we need to do an overhaul of how document equivalence/metadata works in Hypothesis and create tools for this purpose.

kael commented 2 years ago

No, I'm afraid not. We've been discussing internally that we need to do an overhaul of how document equivalence/metadata works in Hypothesis and create tools for this purpose.

Alright then.

Yes, some clarification of document uniqueness would be nice, also with the problem of the title of the bookmarked page not being the expected one but apparently the one from the initial bookmark. But if the first bookmark sends wrong metadata (e.g because of a SPA client-side navigation, and/or not correctly updated canonical URL) to the Hypothesis server, these wrong metadata pollute the DB for ever, and it creates some unexpected annotations.

Anyway, I'll be waiting for that ISSN related bug to be fixed, 'cos I have some RFC to bookmark. :smile:

Cheers !

kael commented 1 year ago

Since ISSNs in general refer to ongoing publications or collections rather than specific articles, we could treat them specially.

I've just realized that Hypothesis was indexing ISSNs like it is indexing DOIs, but it seems to have stopped, there are some cases of indexed-based ISSN content.

It'd be awesome if you could index ISSN (and ISBN as well). And ISS[N|B] wildcard search would be neat too.