hypothesis / h

Annotate with anyone, anywhere.
https://hypothes.is/
BSD 2-Clause "Simplified" License
2.91k stars 426 forks source link

"Linus" matches "chines" on a digital book #2634

Closed dwhly closed 8 years ago

dwhly commented 8 years ago

Go here:

http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:3/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1

You'll see that there is an annotation on the string "chines", which is part of the sentence "Our computational machines are constantly engaging in conversations, extending and accepting invitations, deciding who or what gets to enter or not."

The selection on the original annotation is actually "Linus" which is from this annotation https://hypothes.is/a/qUfzE496S6mLWnDd4vlb1g, which originally was made on the "Dedication" chapter here: http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:1/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1

Our fuzzy anchoring tech seems overly aggressive on this short bit.

judell commented 8 years ago

I expect this will resolve when the doc equivalence work is completed.

Checking API results for:

http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:3/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1

Results include a mixture of URLs with /1:3/ and /1:1/

        "uri": "http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:3/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1",
        "uri": "http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:3/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1",
        "uri": "http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:3/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1",
        "uri": "http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:3/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1",
        "uri": "http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:3/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1",
        "uri": "http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:3/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1",
        "uri": "http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:3/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1",
        "uri": "http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:3/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1",
        "uri": "http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:3/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1",
        "uri": "http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:3/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1",
        "uri": "http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:1/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1",
        "uri": "http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:1/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1",
        "uri": "http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:1/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1",

Propose closing this now, reopening if it persists after doc equivalence work is completed.

jeremydean commented 8 years ago

this is a pretty bad look for the current annotation happening as it's page one of the intro for the book we are annotating this week.

can we archive the problematic annotation in the short term? or manually delete it from the intro?

http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:3/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1

it was originally made on the dedication:

http://quod.lib.umich.edu/d/dh/13474172.0001.001/1:1/--ethical-programs-hospitality-and-the-rhetorics-of-software?g=dculture;rgn=div1;view=fulltext;xc=1

dwhly commented 8 years ago

Regardless of doc equivalence, we should never falsely anchor like this. Can we let this card be a placeholder for this phenomenon?

tilgovi commented 8 years ago

@dwhly you can't "never falsely anchor" and have "fuzzy anchoring" as we've usually described it.

dwhly commented 8 years ago

I think at one point we'd discussed a minimum char length for matching just the selection.

tilgovi commented 8 years ago

It used to be that if the selection was short (I can't remember, maybe 32 or 64 characters) then the prefix and suffix had to match.

I removed that restriction because the test @judell did on each iteration of my no-dtm work showed that this caused an unacceptable number of old annotations to not anchor.

This could be re-enabled, but it's not obvious to me that it's a good idea. To date, no framework for objectively evaluating the quality of anchoring has even been discussed.

My point stands, though. You cannot have perfect, fuzzy anchoring. There is no fix for this issue that I can imagine which does not cause other, similar issues.

tilgovi commented 8 years ago

So, if I can be more useful, what I should say is that the next step here is for you to expound on what you mean by "like this".

tilgovi commented 8 years ago

In other words, describe what about this anchoring makes this failure unacceptable as opposed to other failures.

tilgovi commented 8 years ago

I'm sorry. I shouldn't bait you into trying to describe how that failure is unacceptable.

Improving anchoring in Hypothesis is a difficult problem that has to balance subjective and objective measures of correctness. Individual issues regarding anchoring that are filed against this repository are not likely to be resolvable or actionable. More likely, better ways to do approximate matching on the Web needs to be an ongoing research project.

Mostly, I just don't think this issue is helping you. I think it's a distraction that follows from unrealistic expectations.

tilgovi commented 8 years ago

To offer one concrete direction for future work, though, there is one thing I feel confident would be useful.

Develop a JavaScript bitap implementation that treats locality and edit distance separately. The Google Diff-Match-Patch implementation used by tilgovi/dom-anchor-text-quote matches anything that scores above a given threshold parameter, and that score is a combination of distance from expected location and edit distance. For this reason, it's not possible to set a threshold that both allows large moves and disallows large edits.

So, that's what I mean when I say that you cannot just resolve this issue. You must implement whole new anchoring methods and evaluate them.

tilgovi commented 8 years ago

This would also be useful for the FindText API that @shepazu has proposed.

judell commented 8 years ago

Here's the metadata in every page of the series:

<meta name="DC.identifier" content="10.3998/dh.13474172.0001.001">
<meta name="citation_doi" content="10.3998/dh.13474172.0001.001">

So before we even consider fuzzy anchoring I think we need to sort out the aliasing issue for docs that alias to a single identifier but are annotated with distinct URLs.

tilgovi commented 8 years ago

Worth looking at this, perhaps: http://www.crossref.org/06members/best_practices_for_books.html

Deposit DOIs at the title and chapter/entry level.

It's hard to say what should be done when content claims to be the same as another one but isn't.

tilgovi commented 8 years ago

Although, in this case the rel=canonical is different.

judell commented 8 years ago

Although, in this case the rel=canonical is different.

Exactly. Which suggests maybe prioritizing rel=canonical over, e.g., citation_doi.

tilgovi commented 8 years ago

What could "prioritize" mean? Either the URI is treated as equivalent for the purposes of fetching annotations or it isn't.

shepazu commented 8 years ago

The FindText API would not resolve issues like this. It could be used to more efficiently find possible matches, but the webapp (or browser) would still need to evaluate the matches for suitability and best match within the document. Strategies like the ones @tilgovi proposes would be useful at that evaluation level.

In this case, since these are 2 different URLs, I tend to agree that some refinement needs to be done in how doc equivalence, citation_doi, and so on, are handled. Perhaps if the DOI is the same but the URL (canonical or actual) is different between the current doc and the annotation's target, the edit distance tolerance should be decreased?

Or maybe Annotator should introduce the concept of a "partial match", which filters out "matches" which are too far past some edit-distance threshold to be included with any confidence, but which would allow users (or maybe only admins) to see the "mismatches" if they really wanted to. Perhaps even give the user/annotator/author/admin some tweaking features to let them try to match different results, and re-anchor orphaned annotations (which could then be data-mined to improve the algorithm overall). That could be done without undermining user confidence by mis-anchoring.

Honestly, I'm surprised that this was marked as a match at all… not only is the edit distance relatively high (20-25% of the total selection length) on a short selection, but the prefix and suffix are wildly different, as is the character distance. Robust anchoring should not only take into account the idea that text can change or move, but also be deleted. In a scenario where the edit distance percentage is high and the selection short, the prefix and suffix edit distance should be low, or you'll end up making too many matches.

BTW, in the longer-term, another technique that could be used is capturing the intent of the annotation, as in the Web Annotation Data Model roles/motives. If the user is "editing" (that is, with the copy-edit use case) rather than merely "commenting", then there might be more of an expectation that the selection will change (and thus have a higher edit-distance tolerance) than in normal scenarios.

nickstenning commented 8 years ago

https://github.com/tilgovi/dom-anchor-text-quote/pull/1 addresses this issue.