siefkenj / unified-latex

Utilities for parsing and manipulating LaTeX ASTs with the Unified.js framework
MIT License
85 stars 20 forks source link

Method to leave `math` and `inlineMath` sections as is #95

Closed dmca-glasgow closed 3 months ago

dmca-glasgow commented 4 months ago

Hi there, I've been messing with this package for a few days, trying to get my head round it!

So far, I've been able to achieve more of my objectives with this package than anything else I've tried, so thank you!

I should start by saying I'm new to LaTeX, I'm a TypeScript dev who has been tasked with making accessible HTML versions of some LaTeX documents. I have already made a Markdown to accessible HTML tool, so what I'm trying to achieve here is to convert LaTeX to Markdown, then feed it through my existing pipeline. I like converting LaTeX to Markdown as it strips presentation commands from the content in a way that's sort of straightforward to explain to mathematicians 😂. What I'm aiming for is essentially Markdown with embedded TeX.

Now to the problem I'm stuck on...

Given this LaTeX to Markdown processor:

import { unified } from 'unified';
import { unifiedLatexFromString } from '@unified-latex/unified-latex-util-parse';
import { unifiedLatexToHast } from '@unified-latex/unified-latex-to-hast';
import rehypeRemark from 'rehype-remark';
import rehypeStringify from 'rehype-stringify';
import { State } from 'hast-util-to-mdast';
import { Element } from 'hast';
import { InlineMath } from 'mdast-util-math';

const file = await unified()
  .use(unifiedLatexFromString)
  .use(unifiedLatexToHast)
  .use(rehypeRemark, {
    handlers: {
      span: customSpanHandler
    },
  })
  .use(remarkMath)
  .use(remarkStringify)
  .process(`
    \\documentclass{article}

    \\begin{document}

    Hello $ C_L $ you.

    \\end{document}
  `);

function customSpanHandler(state: State, node: Element) {
  const { className } = node.properties;

  if (Array.isArray(className)) {
    if (className.includes('inline-math')) {
      const math = node.children[0] as Text;

      const result: InlineMath = {
        type: 'inlineMath',
        value: math.value,
      };

      state.patch(node, result);
      return result;
    }
  }

  return state.all(node);
}

console.log(String(file))
// Hello $C_{L}$ you.

You'll see I've fixed a problem already, where special characters like _ are being escaped by remark-stringify (I can create a PR for this if you are interested).

However you'll notice the curly braces have been added to the inline-math node. I've checked this with MathJax and it seems quite happy with it, but is there a way to keep the math nodes as "verbatim" as possible?

Thanks.

siefkenj commented 4 months ago

unified-latex parses to an AST. Since C_L and C_{L} produce the same AST, they are indistinguishable to unified-latex. If you really want to keep the original string, you can look at the .position attribute on math objects and fetch the corresponding original string. You can then turn that into a unified-latex string, which won't be touched during further processing. (You probably don't want to do this, though. Unless you really trust the authors of the original source. unifiedLatexToHast prepares math strings for katex which is very similar to mathjax. Many people's tex code won't actually render in mathjax and you need to do preprocessing to get it ready for mathjax.)

I don't have an editor in front of me right now, but something like this (again, I don't promise this code works; it should give you an idea though.):

const source = `
    \\documentclass{article}

    \\begin{document}

    Hello $ C_L $ you.

    \\end{document}
  `
const file = await unified()
  .use(unifiedLatexFromString)
  .use(function () {
    return (tree) => {
      visit(tree, (node) => {
        if (match.math(node)) {
          node.content = [{
            type: "string",
            content: source.slice(node.position.start.offset, node.position.end.offset)
          }]
        }
      })
    }
  })
  .use(unifiedLatexToHast)
  .use(rehypeRemark, {
    handlers: {
      span: customSpanHandler
    },
  })
  .use(remarkMath)
  .use(remarkStringify)
  .process(source);
dmca-glasgow commented 3 months ago

Thanks, I'll experiment with this!

I am expanding macros (and I intend to roll my own plugin to expand \DeclareMathOperator as well) so I assume this wont work, but will have a play to find out the true limitations.

I'm struggling to find the right keywords to search for here.. why are the curly braces added? Are they macro parameters? Why does the original maths work without them?

Just to explain why I care.. my colleagues and I have talked to visually impaired people who studied/are studying subjects with maths coursework, and found it's common that they prefer to just get hold of the the LaTeX source files, because they are compatible with a screenreader, or other assistive device. So in my accessible HTML I'm planning to render the "spatial" maths with MathJax, but have an option to swap that out for inline code/fenced code blocks with syntax-highlighted LaTeX, if the user prefers. So in this case, the LaTeX will be visible by the consumer.. which I realise is unusual!

siefkenj commented 3 months ago

The brackets are added during printing. You can look at the printRaw function which does printing without any formatting.

The reason the brackets are added is so that the resulting code is always valid. For instance, consider $_\text a$. This is interpreted by LaTeX the same as $_{\text{a}}$. However, if we do something like \mathrm \text a, this is interpreted as \mathrm{\text}a, which is invalid code. In actual tex _ and ^ and a named macro are all treated differently from each other. In unified-latex, they are all treated as macros (though ^ and _ do not have the \ escape character). The moral in the end is that it is very hard to tell when you can omit the curly braces and still have valid LaTeX. However, you can always include the braces and have valid latex.

I can see how the extra braces might be annoying for some blind users to read, but you might decide it is worth it. For example x_10 is actually x_{1} 0, not x_{10}. Many authors write things like x_ab (or, the one that I HATE is when they write \frac12 instead of \frac{1}{2}). And, even being experienced myself, I have misinterpreted LaTeX source before!

So, like I said, if you trust the authors, you can accept their source, but it might be better for your users to have consistent source over what authors tend to write.

siefkenj commented 3 months ago

I am expanding macros (and I intend to roll my own plugin to expand \DeclareMathOperator as well) so I assume this wont work, but will have a play to find out the true limitations.

Look in unified-latex/unified-latex for an example of expanding declared commands. Most of the framework should be there for you.

dmca-glasgow commented 3 months ago

Thanks @siefkenj for the detailed answers!

The moral in the end is that it is very hard to tell when you can omit the curly braces and still have valid LaTeX. However, you can always include the braces and have valid latex.

That's great to know, so without braces is more of a shorthand. I'll leave the braces in that case, as it seems probably easier for someone learning as it's more consistent/logical/straightforward. If coursework authors complain I'll see what can be done in the printRaw function.

I'll close this issue 👍

Look in unified-latex/unified-latex for an example of expanding declared commands. Most of the framework should be there for you.

Yes, I have already found it and I'm using it. I'm going to duplicate that functionality for \DeclareMathOperator today!

Would you like a PR prepared for unified-latex-to-mdast which fixes the incorrect escaping of math in Markdown?

I have also patched something else, however a PR maybe isn't as clear, but just in case it helps...

This test explains what I'm trying to do:

test('latex to markdown with sidenotes', async () => {
  const md = await processLatex(`
    Some \\textbf{bold} text.

    \\begin{framed}
    My content \\sidenote{and \\textbf{sidenote}}.
    \\end{framed}
  `);

  const expected = unindentString(`
    Some **bold** text.

    My content :sidenote[and **sidenote**].
  `);

  expect(md).toBe(expected);
});

I found that to accomplish the bold text inside the sidenote I needed to change this line to:

node.content === 'sidenote'
  ? (node.args || [])[0].content.map(toHast).flat()
  : (node.args || []).map(toHast).flat()

I am happy to create PR for this too if you like, or please tell me if what I'm doing is ill-advised!

dmca-glasgow commented 3 months ago

No longer needed (at least for now)

siefkenj commented 3 months ago

Any PRs welcome :-). \DeclareMathOperator especially :-)