siefkenj / unified-latex

Utilities for parsing and manipulating LaTeX ASTs with the Unified.js framework
MIT License
91 stars 24 forks source link

Index tokens ('^', '_') are parsed as string #114

Open slava-arapov opened 2 months ago

slava-arapov commented 2 months ago

Hi Jason.

Thanks for your parsing utilities. We are making a latex formula editor and your work helps us a lot in development.

Unfortunately, I can't find a solution to one problem.

The problem

Superscript and subscript tokens (^, _) are often used in mathematical expressions. They are recognized correctly in math mode but when ^ and _ are inside a group or in a deep level (index of index of index), they are treated as text by the parser.

It seems that math mode stops being inherited inside a group.

I tried to get some AST trees in the Playground, here are some examples:

  1. $a_{b}$ No groups. Works correctly:

    {
    "type": "root",
    "content": [
    {
      "type": "inlinemath",
      "content": [
        {
          "type": "string",
          "content": "a"
        },
        {
          "type": "macro",
          "content": "_",
          "escapeToken": "",
          "args": [
            {
              "type": "argument",
              "content": [
                {
                  "type": "string",
                  "content": "b"
                }
              ],
              "openMark": "{",
              "closeMark": "}"
            }
          ]
        }
      ]
    }
    ]
    }
  2. ${a_b}$ Wrapped in group. a_b is parsed as string:

    {
    "type": "root",
    "content": [
    {
      "type": "inlinemath",
      "content": [
        {
          "type": "group",
          "content": [
            {
              "type": "string",
              "content": "a_b"
            }
          ]
        }
      ]
    }
    ]
    }
  3. $a_{b_{c}}$ First level is ok but b_ in subscript argument is parsed as string and it is OK for default _ macro settings:

    {
    "type": "root",
    "content": [
    {
      "type": "inlinemath",
      "content": [
        {
          "type": "string",
          "content": "a"
        },
        {
          "type": "macro",
          "content": "_",
          "escapeToken": "",
          "args": [
            {
              "type": "argument",
              "content": [
                {
                  "type": "string",
                  "content": "b_"
                },
                {
                  "type": "group",
                  "content": [
                    {
                      "type": "string",
                      "content": "c"
                    }
                  ]
                }
              ],
              "openMark": "{",
              "closeMark": "}"
            }
          ]
        }
      ]
    }
    ]
    }
  4. ${a_{b_{c}}}$ All the expression is wrapped in group. a_ and b_ are parsed as strings:

    {
    "type": "root",
    "content": [
    {
      "type": "inlinemath",
      "content": [
        {
          "type": "group",
          "content": [
            {
              "type": "string",
              "content": "a_"
            },
            {
              "type": "group",
              "content": [
                {
                  "type": "string",
                  "content": "b_"
                },
                {
                  "type": "group",
                  "content": [
                    {
                      "type": "string",
                      "content": "c"
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
    ]
    }

What I tried

In my project I tried to redefine macros this way:

... 
const processor = processLatexViaUnified({
  mode: 'math',
  macros: {
    '^': {
      renderInfo: {
        inMathMode: true,
      },
      signature: 'm',
      escapeToken: '',
    },
    '_': {
      renderInfo: {
        inMathMode: true,
      },
      signature: 'm',
      escapeToken: '',
    },
  },
});

const latexAst = processor.parse(latexString);
...

It helps to handle $a_{b_{c}}$ case with 3 levels but not $a_{b_{c_d}}$ case with 4+ levels:

{
 "type": "root",
 "content": [
  {
   "type": "group",
   "content": [
    {
     "type": "string",
     "content": "a",
     "position": {
      "start": {
       "offset": 1,
       "line": 1,
       "column": 2
      },
      "end": {
       "offset": 2,
       "line": 1,
       "column": 3
      }
     }
    }
   ],
   "position": {
    "start": {
     "offset": 0,
     "line": 1,
     "column": 1
    },
    "end": {
     "offset": 3,
     "line": 1,
     "column": 4
    }
   }
  },
  {
   "type": "macro",
   "content": "_",
   "escapeToken": "",
   "position": {
    "start": {
     "offset": 3,
     "line": 1,
     "column": 4
    },
    "end": {
     "offset": 4,
     "line": 1,
     "column": 5
    }
   },
   "_renderInfo": {
    "inMathMode": true
   },
   "args": [
    {
     "type": "argument",
     "content": [
      {
       "type": "group",
       "content": [
        {
         "type": "string",
         "content": "b",
         "position": {
          "start": {
           "offset": 1,
           "line": 1,
           "column": 2
          },
          "end": {
           "offset": 2,
           "line": 1,
           "column": 3
          }
         }
        }
       ],
       "position": {
        "start": {
         "offset": 0,
         "line": 1,
         "column": 1
        },
        "end": {
         "offset": 3,
         "line": 1,
         "column": 4
        }
       }
      },
      {
       "type": "macro",
       "content": "_",
       "escapeToken": "",
       "position": {
        "start": {
         "offset": 3,
         "line": 1,
         "column": 4
        },
        "end": {
         "offset": 4,
         "line": 1,
         "column": 5
        }
       },
       "_renderInfo": {
        "inMathMode": true
       },
       "args": [
        {
         "type": "argument",
         "content": [
          {
           "type": "group",
           "content": [
            {
             "type": "string",
             "content": "c",
             "position": {
              "start": {
               "offset": 6,
               "line": 1,
               "column": 7
              },
              "end": {
               "offset": 7,
               "line": 1,
               "column": 8
              }
             }
            }
           ],
           "position": {
            "start": {
             "offset": 5,
             "line": 1,
             "column": 6
            },
            "end": {
             "offset": 8,
             "line": 1,
             "column": 9
            }
           }
          },
          {
           "type": "string",
           "content": "_d",
           "position": {
            "start": {
             "offset": 8,
             "line": 1,
             "column": 9
            },
            "end": {
             "offset": 10,
             "line": 1,
             "column": 11
            }
           }
          }
         ],
         "openMark": "{",
         "closeMark": "}"
        }
       ]
      }
     ],
     "openMark": "{",
     "closeMark": "}"
    }
   ]
  }
 ],
 "_renderInfo": {
  "inMathMode": true
 }
}

It also doesn't fix the situation of a group-wrapped expression: ${a_b}$:

{
 "type": "root",
 "content": [
  {
   "type": "group",
   "content": [
    {
     "type": "string",
     "content": "a_b",
     "position": {
      "start": {
       "offset": 1,
       "line": 1,
       "column": 2
      },
      "end": {
       "offset": 4,
       "line": 1,
       "column": 5
      }
     }
    }
   ],
   "position": {
    "start": {
     "offset": 0,
     "line": 1,
     "column": 1
    },
    "end": {
     "offset": 5,
     "line": 1,
     "column": 6
    }
   }
  }
 ],
 "_renderInfo": {
  "inMathMode": true
 }
}

Questions

Is this behavior expected?

Is there any options or workarounds to parse

Thank you in advance.

siefkenj commented 2 months ago

The expected behavior is that inside of math _{} is parsed as a macro and not a string, so this sounds like a bug. It will be a little bit before I have time to investigate this further.

siefkenj commented 2 months ago

This issue goes deeper than I thought. The information about whether to parse in a math environment or a regular environment isn't propagated trough to groups. Getting this to work correctly will require quite a rework of the parsing algorithm.