hypothesis / product-backlog

Where new feature ideas and current bugs for the Hypothesis product live
118 stars 7 forks source link

SPIKE: math expressions written with mathjax.js break annotation (L) #918

Open klemay opened 5 years ago

klemay commented 5 years ago

I think this is separate from https://github.com/hypothesis/product-backlog/issues/414, as #414 refers to writing math expressions in our text editor, whereas this issue is regarding highlighting math expressions as part of an annotation.

Steps to reproduce

  1. Go to https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html
  2. Highlight text that contains a math expression— for example, the paragraph under the heading "Jacobian Matrix and Determinant"
  3. Click the Annotate button on the adder

Expected behaviour

The highlighted text appears in a new annotation card and you are able to type your annotation in the text editor

Actual behaviour

It's inconsistent. Sometimes a new annotation card is created, but the sidebar doesn't pop open; sometimes nothing happens at all. Even when the annotation card is created, though, the sidebar is very slow to open and the math expressions are not rendered in an intelligible way in the quoted text.

Browser/system information

reported by user and replicated by me on Chrome / Firefox / Safari for Mac.

Additional information

I assume for this to work properly, we'd somehow need to convert what is rendered by mathjax.js back to the original input text? This seems like a feature request rather than a bug to me.

robertknight commented 5 years ago

I think this is separate from #414, as #414 refers to writing math expressions in our text editor,

I agree. This is a completely separate issue.

sean-fitzpatrick commented 4 years ago

I just came across this after discovering that highlighting mathematics doesn't work. Hypothes.is is new to me: it was suggested as a tool by our Teaching Centre, so I tried adding it to one of my books: http://www.cs.uleth.ca/~fitzpat/apex-hypothesis/sec_continuity.html

If I try to highlight any MathJax-rendered text, I get an annotation mark on the right-hand side, but no highlighting. Probably because MathJax is also JavaScript so now you have two competing JavaScript components trying to render that piece of the page.

If you try to highlight a paragraph containing math (which is almost every paragraph in a calculus book!) the highlighting extends to the first appearance of math.

judell commented 4 years ago

Thanks for pointing this out, @sean-fitzpatrick.

I'm curious how one selects anything at all in the latest v3 of MathJax. I tried their sample page and couldn't.

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width">
  <title>MathJax example</title>
  <script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
  <script id="MathJax-script" async
          src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js">
  </script>
</head>
<body>
<p>
  When \(a \ne 0\), there are two solutions to \(ax^2 + bx + c = 0\) and they are
  \[x = {-b \pm \sqrt{b^2-4ac} \over 2a}.\]
</p>
</body>
</html>

Here is the DOM:

image

The a in 2a is in the content attribute of an element targeted by a ::before selector. That seems not to be a thing you can select. But maybe there's a mode that makes it so?

sean-fitzpatrick commented 4 years ago

Books built with PreTeXt (like mine) are not yet using MathJax 3. (There are some things at risk of breaking that need to be addressed before PreTeXt can upgrade.) The reason things look so complicated when you inspect MathJax is that there is a lot of accessibility support built in. In particular, there are navigation tools that work with screen readers so that a blind reader can parse that content. MathJax also makes it possible to export your math to Nemeth Braille! :-)

If you right click on a MathJax element, you can choose to change the display mode to SVG, but that doesn't help with highlighting.

judell commented 4 years ago

If you right click on a MathJax element, you can choose to change the display mode to SVG, but that doesn't help with highlighting.

Huh. Live and learn. It's a tricky challenge to be sure, I'm not sure what's the right answer, maybe LaTeX mode for selection?

sean-fitzpatrick commented 4 years ago

OK -- some success.

sean-fitzpatrick commented 4 years ago

Follow-up. I asked Davide Cervone of MathJax about this on their Google group. Here is his response:

OK, here's what's happening: apparently Hypothesis doesn't just mark the beginning and ending too the annotation, but tries to drill down into the containing HTML and mark individual text regions separately (probably to make it possible to have an annotation cross tag boundaries). For MathJax output, that means it wraps each symbol in a separate annotation tag. In CommonHTML out, the character are actually set as 0 height and are contained in a surrounding tag that adds the proper height and depth (so that the bounding box of the character is tight rather than that of the line height as a whole). Because the annotation tag is inside the container that gives the character its height and depth, that means the annotation height is 0, and it doesn't show up (even though it is there).

The HTML-CSS output doesn't try to make the bounding boxes of the characters be correct, and so when Hypothesis inserts the annotations, they are not zero height, and so show up.

In trying to be smart about the annotations, Hypothesis is getting itself in trouble when dealing with MathJax output. If it were not to descend into tags that are completely within its annotation, for example, then it would be able to highlight CHTML output as well as HTML-CSS output.

Davide

Using HTML-CSS isn't an option: it's deprecated in MathJax v2, and gone in v3. For typical student use with things as they are now, I think we just tell them that they can't highlight math.

LMS007 commented 3 years ago

Much of this involves how we do text anchoring and selection. On the page discussed in this ticket, we can see that MathJax creates a <script> element next to each expression. Inside that element lives the clean (original) expressions that can be used to reconstruct a math expression. -- in theory. But in addition to that expression(s), there is also styled unicode chars/text nested in html (primarily <span> tags) that live in the DOM as well reneded by MathJax

The whole thing for something as simple as just an "x" looks like this

<span class="MathJax" id="MathJax-Element-8-Frame" tabindex="0" data-mathml="<math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;><mrow class=&quot;MJX-TeXAtom-ORD&quot;><mi mathvariant=&quot;bold&quot;>x</mi></mrow></math>" role="presentation" style="position: relative;">
   <nobr aria-hidden="true"><span class="math" id="MathJax-Span-72" style="width: 0.658em; display: inline-block;"><span style="display: inline-block; position: relative; width: 0.507em; height: 0px; font-size: 124%;"><span style="position: absolute; clip: rect(1.918em, 1000.51em, 2.674em, -999.997em); top: -2.518em; left: 0em;"><span class="mrow" id="MathJax-Span-73"><span class="texatom" id="MathJax-Span-74"><span class="mrow" id="MathJax-Span-75"><span class="mi" id="MathJax-Span-76" style="font-family: STIXGeneral; font-weight: bold;">x</span></span></span></span><span style="display: inline-block; width: 0px; height: 2.523em;"></span></span></span><span style="display: inline-block; overflow: hidden; vertical-align: -0.059em; border-left: 0px solid; width: 0px; height: 0.691em;"></span></span></nobr>
   <span class="MJX_Assistive_MathML" role="presentation">
      <math xmlns="http://www.w3.org/1998/Math/MathML">
         <mrow class="MJX-TeXAtom-ORD">
            <mi mathvariant="bold">x</mi>
         </mrow>
      </math>
   </span>
</span>
<script type="math/tex" id="MathJax-Element-8">\mathbf{x}</script>

Looking at the resulted textContent of a parent node, it ends up looking like this

"xx\mathbf{x}"

Which does not make much sense because the hidden expression (which we do want) is merged with the visual styled rendered result which is useless without the encapsulating inline styling -- To be clear, we're not going to lift and shift html into the sidebar.

Now let's look at something a bit more complex... Screen Shot 2020-12-23 at 2 42 54 PM

part of the textContent looks like this "output vector, f:ℝn↦ℝmf:Rn↦Rm\mathbf{f}: \mathbb{R}^n \mapsto \mathbb{R}^m, the matrix of all first-order"

And the script tag's content (for the expressions) looks like this \mathbf{f}: \mathbb{R}^n \mapsto \mathbb{R}^m

Where does the readable text start and end relative to the rendered expressions text? The only way I can see being able to surely do this is to actually use the HTML structure and make assumptions about classes and then toss away all the text/unicode chars inside of anything with .MathJax class. Then save the raw expressions in the <script> tag as the "thing" we want to re-render into an expression in the sidebar alongside any captured plain text. But then we also have to save the correct offsets relative to the raw textContent so we can place the highlight in the correct space again.

So in this example...

"output vector, f:ℝn↦ℝmf:Rn↦Rm\mathbf{f}: \mathbb{R}^n \mapsto \mathbb{R}^m, the matrix of all first-order"

We need to throw away the part in bold for the sidebar markup, but keep it around to re-anchor so we have correct offsets counts. There may be other ways to accomplish this and we should discuss further, but what I am suggesting here is perhaps a 4th type of Anchor/Selector at least, or possibly major modifications to other similar places in our annotator.

Also, this would only work for MathJax which is specific and fragile. I'm ignoring any larger more general use cases.

LMS007 commented 3 years ago

One more small fact worth mentioning there. type="math/tex" is ignored by the browser and assumed to be just a "data block" which simply means it does not get executed as js (as far as I know). Our annotator also captures js types such as type="text/javascript" and will highlight them just the same. This means that our sidebar annotation blockquotes will always contain the text inside of a script tag if that tag is in the captured range. In the vast majority of cases, script tags are almost never in between content so this is rarely a problem, but if in fact if they are, then folks would be inadvertently capturing and quoting code that they would not otherwise see in the content.

So perhaps there are 2 classes of issues here:

  1. Don't capture script content if it is in fact js code.
  2. Have some special exceptions for data block types that we recognized such as "math/tex" that we can re-render in the sidebar. -- This also means we have to rip out the original unstyled rendered html that MathJax (or similar) produced.
nise commented 2 years ago

Are there any plans to fix this problems?

robertknight commented 2 years ago

A lot of code in potentially relevant areas has changed since the issue was filed, so it will need re-evaluating to figure out which problems still exist and make sure the steps to reproduce are still valid. Nobody has been planning to do that as far as I know.