siefkenj / unified-latex

Utilities for parsing and manipulating LaTeX ASTs with the Unified.js framework
MIT License
85 stars 20 forks source link

Comments after empty string do not have position #86

Closed pddg closed 6 months ago

pddg commented 6 months ago

Environment

Expected behavior

Comments should have positions.

Actual behavior

I use following script to parse the TeX source.

#!/usr/bin/env node

import * as fs from "fs";
import { getParser } from "@unified-latex/unified-latex-util-parse";

const content = fs.readFileSync("image.tex");
const parser = getParser();
const parsedAst = parser.parse(content.toString());

console.log(JSON.stringify(parsedAst, undefined, "  "));

In the following cases, comment1 and comment3 have no position.

\begin{document}
% comment0
    % comment1
    abcd % comment2
    %comment3
% comment4
\end{document}
Parsed AST ```json { "type": "root", "content": [ { "type": "environment", "env": "document", "content": [ { "type": "comment", "content": " comment0", "sameline": false, "leadingWhitespace": false, "position": { "start": { "offset": 16, "line": 1, "column": 17 }, "end": { "offset": 28, "line": 3, "column": 1 } } }, { "type": "comment", "content": " comment1", "sameline": false, "leadingWhitespace": false }, { "type": "string", "content": "abcd", "position": { "start": { "offset": 47, "line": 4, "column": 5 }, "end": { "offset": 51, "line": 4, "column": 9 } } }, { "type": "comment", "content": " comment2", "sameline": true, "leadingWhitespace": true, "position": { "start": { "offset": 51, "line": 4, "column": 9 }, "end": { "offset": 63, "line": 5, "column": 1 } } }, { "type": "comment", "content": "comment3", "sameline": false, "leadingWhitespace": false }, { "type": "comment", "content": " comment4", "sameline": false, "leadingWhitespace": false, "position": { "start": { "offset": 77, "line": 6, "column": 1 }, "end": { "offset": 88, "line": 7, "column": 1 } } } ], "position": { "start": { "offset": 0, "line": 1, "column": 1 }, "end": { "offset": 102, "line": 7, "column": 15 } } } ], "position": { "start": { "offset": 0, "line": 1, "column": 1 }, "end": { "offset": 103, "line": 8, "column": 1 } } } ```

Is this expected behavior?

siefkenj commented 6 months ago

This appears to be a bug. I'll look into it!

siefkenj commented 6 months ago

Apparently that was purposeful behavior, because the comment nodes have been edited. Their position information was deleted to show that they were not the original nodes present in the tree. (You'll notice that comment1 has leadingWhitespace: false in the parsed version when it actually does have leading whitespace.)

I don't know if deleting the position is really needed though...What is your use case?

If you use parseMinimal, no nodes are modified and you get the actual parse tree exactly as written in the source (additionally, no macro arguments are attached, etc., since it's a minimal parse). Is that the function you want to be using?

pddg commented 6 months ago

I maintain a plugin for LaTeX for textlint, a tool that primarily parses text and points out violations on a rule basis. This plugin converts LaTeX ASTs to the ASTs required by textlint.

textlint needs to understand the semantics of structured text. For example, bullets, headings, formulas, etc. In parseMinimal, a macro such as \item, for example, is taken as a separate node: the \item macro and the string that follows it. We could do the same kind of post-processing in our own code that getParser().parse() does, but we would prefer to avoid it if possible.

siefkenj commented 6 months ago

Okay. I don't see a great reason to delete the position info, so I will stop deleting it for v1.7.0

Fixed in #87