siefkenj / unified-latex

Utilities for parsing and manipulating LaTeX ASTs with the Unified.js framework
MIT License
85 stars 20 forks source link

feat: support embellishments and non-punctuation delimiters #42

Closed theseanl closed 9 months ago

theseanl commented 1 year ago

Previously, it couldn't attach arguments for macros whose argspec

This PR adds support for it. The second use-case may be marginal, I initially tried to make embellishments work, then realized that the same code would enable the second feature and unifies some codes.

Since a single embellishment argspec can represent multiple arguments, such as in e{^_} where each token ^ and _ would represents 2 arguments, gobbleSingleArguments signature was changed to return Ast.Arguments | Ast.Arguments[] | null.

Currently, this just puts embellishment tokens as openMark in Ast. Would it be preferable to have a new separate property for embellishments?

Next goal would be to support/test macro expansions for such arguments. Once it's done, I'd like to work on #24.

Please take a look and provide comments, and feel free to make your own changes!

siefkenj commented 1 year ago

Thanks for the contribution!

It looks like you have some tests failing...

Here are some thoughts:

  1. It would be really nice not to change the signature of gobbleSingleArgument, but I cannot think of a good way around it. One possibility is to change it to always return [] | null
  2. I think more of the existing infrastructure can be reused. The algorithm for e{x} should be very similar to tx m. I think you can reuse the algorithms for grabbing those arguments. A possible idea for e{xyz} is that you keep a record of whether x,y,z has been captured. If one of them has been, you can re-call the function with y,z (or whatever hasn't yet been captured). This could potentially result in a way to keep gobbleSingleArgument returning a singleton. If it always returned the first argument it could gobble, then gobbleArgument would know that on a type=="embellishment" argument, it would need to call until nothing was gobbled. Then then it could rearrange the argument list to appear in the order specified by the oringinal macro spec.
  3. It would be nice to keep e{^} consistent with the parsing of ^. Currently ^ parses as a macro with escapeToken == "^". I think it's reasonable to parse e{x} as
    {
                    "type": "argument",
                    "content": [
              {
                    "type": "macro",
                    "content": "^",
                    "escapeToken": "",
                    "args": [
                        {
                            "type": "argument",
                            "content": [
                                {
                                    "type": "string",
                                    "content": "2"
                                }
                            ],
                            "openMark": "{",
                            "closeMark": "}"
                        }
                    ]
                }
                ]
                openMark: "",
                closeMark: ""

That solution is not exactly elegant, though...

theseanl commented 1 year ago

Regarding re-using existing algorithm for t

I wasn't aware that optionalToken shares similar traits. However, the way I see it is the other way around: Parsing optionalToken should use the new infrastructure findBracePositions. A convoluted example, but the current algorithm for t can't correctly figure out arguments for the following:

\NewDocumentCommand{\asdf}{ t^ r__ }{\IfBooleanTF{#1}{123}{456} #2}
\asdf^_789_

It should be expanded to 123 789. The new findBracePositions algorithm was implemented basically to solve cases like this. I will fix this and include it as a test case.

Signature of gobbleSingleArgument

One idea I have thought was to artificially decompose e{^-_} into e{^} e{-} e{_}. However it seemed that signatures are intimately tied to the raw .tex source and ArgSpec is assumed to be a faithful representation of it, there is even a warning https://github.com/siefkenj/unified-latex/blob/f2b8c107aed4c7ae585ebbd832ef9085cee35a46/packages/unified-latex-util-argspec/README.md?plain=1#L30

A possible idea for e{xyz} is that you keep a record of whether x,y,z has been captured. If one of them has been, you can re-call the function with y,z (or whatever hasn't yet been captured). This could potentially result in a way to keep gobbleSingleArgument returning a singleton

It seems doable, but then the caller of gobbleSingleArgument should maintain some kind of state, about embellishment tokens we have matched until now, its index, and so on. gobbleSingleArgument will need a 4th argument startPosInStartPos, which would indicate the starting position inside the current node's content (if it is Ast.String node). Maintaining state would be done inside the body of gobbleArguments, but it doesn't look like the right place for that logic, because any serious logic for argspec is in gobbleSingleArgument function body...

If you prefer to have its return type Ast.Arguments[] | null, then I'll refactor that.

Tests

I guess it should be passing now. For some reason, tests in unified-latex-cli fail due to timeout on my machine, so I can't test it but the rest of tests were passing on my end.

Besides, I have made some tests for until argspecs skip, because they were failing on my end in the current HEAD.

siefkenj commented 1 year ago

I think the real issue with changing e{xy} to e{x} e{y} is that, I believe that e{xy} allows x/y to occur in either order, where e{x} e{y} would not.

The mutation warning is just that: a warning against mutating. However, if you create a new ArgSpec or make a deep copy and mutate the deep copy, it will be fine.

I'm leaning more towards making gobbleSingleArgument no return an array and moving some looping logic into gobbleArgument.

theseanl commented 1 year ago

I think the real issue with changing e{xy} to e{x} e{y} is that, I believe that e{xy} allows x/y to occur in either order, where e{x} e{y} would not.

I wasn't aware of that! Thank you for pointing it out. I've updated code to implement this behavior. Now given with this order-agnostic behavior of embellishment, moving the looping logic to gobbleArguments feels slightly more natural. However it's still annoying to pass around matched embellishment token and its index between gobbleArguments and gobbleSingleArgument, so I kept the signature in the end. The distinction of gobbleArguments and gobbleSingleArgument may be unnecessary.

Also I've fixed optionalToken and added a test for \asdf^_789_ as promised.

siefkenj commented 1 year ago

FYI, I just merged #45 which switches the tests to vitest. They might run successfully on your computer now :-)

theseanl commented 1 year ago

until also manifests the same issue as optionalToken, so I went on to fix that as well :) If you look at the code, it should be obvious where I am going. We will be able to handle another edge case: #46. How does it look? What's your thought on gobbleSingleArgument? Also could you elaborate more on escapeToken == "^" (if this is still an issue)?

siefkenj commented 1 year ago

Thanks for this! I left a few comments. Also, please run Prettier on all modified files.

I'd still like to move some of the complexity to gobbleArguments and have gobbleSingleArugment only return one thing rather than an array.

theseanl commented 1 year ago

I've added eslint and added curly rule which enforces statements in if ... else to be enclosed in a separate block. I couldn't find a prettier config, and I doubt it's being consistently used in codes in this repository, for example prettier seems to have some peculiar rules for trailing commas and it is not consistently enforced. I've tried to add one that matches the overall coding style of the repository. It revealed some obvious "error"s (in the sense of prettier's) whose fix was trivial.

Regarding refactoring gobbleSingleArgument, I think it's doable and I'll try to do it soon.

siefkenj commented 1 year ago

Thanks for the linter updates! Let me know when this is ready for another review :-)

theseanl commented 12 months ago

Refactored gobbleSingleArgument. I don't really like the additional complexity introduced by artificially separating logic into two places, but I guess I can live with it. Now gobbleArguments always create a structuredClone of the argSpec because gobbleSingleArgument may mutate it. It may as well check for spec.type === 'embellishment' and create a clone only in that case, but I wanted to make gobbleArgument agnostic to individual argspec types.

theseanl commented 12 months ago

Also I think the current tsconfig is somewhat faulty,

cd packages/unified-latex
npm run build

fails with TS errors which shouldn't be an error, and describe and expect isn't resolved in vscode.

siefkenj commented 12 months ago

Also I think the current tsconfig is somewhat faulty,

cd packages/unified-latex
npm run build

fails with TS errors which shouldn't be an error, and describe and expect isn't resolved in vscode.

Did you do a general build first? If you haven't built the dependencies, it will fail.

I have vitest/global listed in tsconfig.build.json. That is supposed to explain to typescript that describe and it, etc.. all exist. I don't know why it doesn't work...

theseanl commented 12 months ago

Turned out that doing a general build fixes the issue with npm run build. It initially failed, so I fixed some type errors in structured-clone that were causing the issue.

Regarding tsconfig, tsconfig.build.json explicitly excludes any test files, and any other tsconfig inherits from it, so IDE think test files are not part of the project and doesn't see vitest/global defined in tsconfig while trying to resolve types.

IMO, tsconfig.build.json should inherit from tsconfig.json, and excluding test files should be done only in tsconfig.build.json file. Each package directory should contain two tsconfig.build.json and tsconfig.json, and npm run build command should invoke tsc with tsc -b tsconfig.build.json. I'm not confident that I'll be able to change every relevant part, so it'd be nice if you could fix it :)

siefkenj commented 12 months ago

IMO, tsconfig.build.json should inherit from tsconfig.json, and excluding test files should be done only in tsconfig.build.json file. Each package directory should contain two tsconfig.build.json and tsconfig.json, and npm run build command should invoke tsc with tsc -b tsconfig.build.json. I'm not confident that I'll be able to change every relevant part, so it'd be nice if you could fix it :)

Could you do a separate PR that shows what you're talking about in a couple of the workspaces? The tsconfig business is really a pain and I've been trying to figure out how to get it right for a long time...

Alternatively, importing describe, it and friends in each test file will get rid of the typescript warnings.

siefkenj commented 12 months ago

I would prefer to recreate it on each call rather than mutate it.

siefkenj commented 12 months ago

Agreed. Can you just make it throw an error for now if it encounters a group and we can fix it in another PR

siefkenj commented 11 months ago

@theseanl Were you planning on finishing the last changes?

theseanl commented 11 months ago

I need to eventually get it done, because it is needed for my other works, but I've been rather busy and less motivated to refactor an already working code to meet a goalpost that keeps moving. Conceptual distinction of singleton and multiple objects feel artificial to me, I even think that such a tendency may be influenced by one's linguistic backgrounds.. I'd be ok if you take over and address the points you've mentioned in reviews. I'll try to find some time within a few weeks.

siefkenj commented 9 months ago

@theseanl Thanks for your hard work! I have fixed the last of the issues. I couldn't figure out how to get pushing to your branch working, so I created PR #57 You can either pull from that branch, or I can close this PR and merge #57 instead.

theseanl commented 9 months ago

Thank you for picking up the slack!