hypothesis / support-legacy

a place for tracking support-related work and projects
3 stars 0 forks source link

Anchoring issues at Hybrid Pedagogy #164

Open dwhly opened 3 years ago

dwhly commented 3 years ago

Describe the bug A large percentage (> 30% of the 800+) annotations are orphaned or misanchored.

To Reproduce Steps to reproduce the behavior:

  1. Go to https://hybridpedagogy.org/resisting-edtech/

Expected behavior Annotations will be properly anchored

Screenshots Screenshot 2020-12-09 at 7 26 24 PM

Additional context It seems there are two problems here: 1- I suspect there is a document equivalence issue here, perhaps originating from this line: <link rel="alternate" type="application/rss+xml" title="Hybrid Pedagogy" href="https://hybridpedagogy.org/rss/" />

2- Even if many more annotations are being fetched than still apply to this page, they still should misanchor so completely. At worst, they should orphan.

robertknight commented 3 years ago

1- I suspect there is a document equivalence issue here, perhaps originating from this line:

There is definitely a document equivalence problem here.

Even if many more annotations are being fetched than still apply to this page, they still should misanchor so completely. At worst, they should orphan.

Currently the client's fuzzy anchoring is done based on character-level edit distance with no semantic understanding of the text. Once https://github.com/hypothesis/client/pull/2779 lands we'll have a knob that we can use to tune the maximum edit distance allowed. This will allow easy tuning of the recall of quote matching (the proportion of quotes that are anchored), but it won't directly help with the of precision (which of the anchored quotes are good semantic matches for the original text).

In future we could improve this by computing edit distance on units of text that have more semantic meaning (ie. words or phrases) and taking better account of context.

dwhly commented 3 years ago

Actually it looks like we ignore rss+xml... https://github.com/hypothesis/client/blob/734e3a25318364819a8c38ef881e4788a2b06365/src/annotator/plugin/document.js#L180

judell commented 3 years ago

Our document_id for https://hybridpedagogy.org/resisting-edtech/ is 3304 (a very early number!).

There are a whole bunch of URLs bound to 3304, at both hybridpedagogy.org and its predecessor digitalpedagogylab.com.

This goes back a long way, and I suspect the bogus equivalences happened before we excluded the RSS feeds that led to these kinds of combinatorial explosions. It may even have been the case that prompted that exclusion.

As we realized in our last discussion about doc equiv, there may be no rational or easy way to untangle things, since equivalences destroy pre-existing document ids.

More recent articles at hybridpedagogy.org don't exhibit the problem. For example, https://hybridpedagogy.org/syllabus-metaphor/ has the document_id 686787, and isn't connected to any other articles.