hypothesis / product-backlog

Where new feature ideas and current bugs for the Hypothesis product live
118 stars 7 forks source link

API searches of DOI should be case insensitive #1174

Open esanzgar opened 3 years ago

esanzgar commented 3 years ago

General explanation

API searches to DOIs should return same number of annotations independently to its casing.

This issue is related with:

Steps to reproduce

  1. curl "https://hypothes.is/api/search?_separate_replies=false&group=NMb8iAjd&uri=doi:10.1097/JOM.0000000000000063"
  2. curl "https://hypothes.is/api/search?_separate_replies=false&group=NMb8iAjd&uri=doi:10.1097/jom.0000000000000063"

Expected behaviour

Request 1) and 2) should return the same amount of annotations.

Actual behaviour

Request 1) returns 3 annotations, while 2) returns 0 annotations.

Additional details

DOI is case insensitive according to official the documentation: https://www.doi.org/doi_handbook/2_Numbering.html#2.4

This is likely affecting ~350 publication in Europe PMC, many more in pubmed and other editorial sites.

Example in pubmed: https://pubmed.ncbi.nlm.nih.gov/24806729/ It should display one annotation in the public group

esanzgar commented 3 years ago

The 'uri` property in the elasticsearch schema:

https://github.com/hypothesis/h/blob/e92e7f23c117326b8fdc866555c456944528e65b/h/search/config.py#L56-L60

is associated with a custom uri analyzer:

https://github.com/hypothesis/h/blob/e92e7f23c117326b8fdc866555c456944528e65b/h/search/config.py#L67-L104

I don't know enough about elasticsearch, but it seems plausible that the analyzer could modified to treat DOIs (doi:.prefix/suffix) in a case-insensitive manner, while treating the the rest of the URIs as currently done.