feat(parser): reimplement parser in `langium` instead of `jison`

Yokozuna59 commented 1 year ago

From this conversion. It would be great to reimplement the parser in langium instead of jison the following reasons:

The current jison package is unmaintained.
Langium supports LSP.
Langium has cleaner and easier syntax.
Langium has typescript integration with a fully typed AST.
Langium has better error reporting.

And much more!

I'm currently working to create a separate package using it here Yokozuna59/mermaid-lint. Anyone is welcome to contribute and help me! Even thoughts would be fine.

Here is an example I've built using Langium for pie charts (toggle the icon next to the link to show the AST). playground

Note: It's not finished; it's just an example.

related issues:

4
873
2523
3261
3694
3100
2566
2208
1810
3530

sidharthv96 commented 1 year ago

This is exciting @Yokozuna59 ! Once you have a working parser, can you create a PR like https://github.com/mermaid-js/mermaid/pull/3432, so that we can see how much the size would increase with Langium?

I'd love to see the existing parsers modernised!

Yokozuna59 commented 1 year ago

Once you have a working parser, can you create a PR like #3432, so that we can see how much the size would increase with Langium?

Sure! I'm currently planning to support the pie chart since it's the easiest one. But I think I should support the common syntax and configuration first: comments, directives, accessible titles and descriptions, etc.

I've got to improve the current comments and directives regexes since they aren't covering some weird cases; e.g., in #1645, if text has been wrapped inside {} it won't be skipped and shown as text instead.

classDiagram
    class Cat {
        +int age %% not skipped comment 
    }

Yokozuna59 commented 1 year ago

@sidharthv96 It doesn't look like Langium or Chevrotain are customizable enough in our use case.

I was trying to match the diagram title (and accessible title and description in the same way), so I wanted to match everything after title: and set the value of it as title, but it doesn't seem that there is a clean way using Langium or Chevrotain from this discussion.

sidharthv96 commented 1 year ago

I think I was able to do that in chevrotain.

https://github.com/mermaid-js/mermaid/pull/3432/files#diff-d7ef899ed988dee0393145afe276d4821390b2d40f4bac039bea0f129e4338afR102-R106

Yokozuna59 commented 1 year ago

I think I was able to do that in chevrotain.

https://github.com/mermaid-js/mermaid/pull/3432/files#diff-d7ef899ed988dee0393145afe276d4821390b2d40f4bac039bea0f129e4338afR102-R106

Although not in the same way, I was able to do it too.

I've finished accTitle and single and multi-line accDesc with their unit tests. But I haven't committed yet; some minor changes might be needed. Then I'm planning to support directives, which might take a lot of time because of the weird style and number of valid values.

Yokozuna59 commented 1 year ago

I think accTitle and accDescr are done with their test cases. (Some weird test cases that are not handled have .todo).

Can some verify I haven't forgotten anything? link

aloisklink commented 1 year ago

Then I'm planning to support directives, which might take a lot of time because of the weird style and number of valid values.

Hmmmm, this is kinda a hack, so sticking it langium is probably the ideal way of doing this, but if you're feeling lazy, I wonder if it's easier to do this a pre-proccess step, e.g. something like:

const diagram = `%%{init: { **insert argument here**}}%%
  graph TD
    A-->B`;
const directiveRegex = /%%(\{[\w\W]*\})%%/gmU; // use U-flag to make `*` lazy
const directives = diagram.matchAll(directiveRegex);
const diagramWithoutDirectives = diagram.replace(directiveRegex, "");

const parsedDiagram = yourLangiumParser.parse(diagramWithoutDirectives);

It seems like directives are more like a metadata thing that applies to all diagrams, rather than something part of each diagram, so it might make sense to split it away from the diagram parser :shrug:

Edit:

Just thought of something else, all mermaid diagram definitions support using a YAML front-matter before the diagram, e.g. like:

```mermaid
---
# some YAML here
title: "Hi"
---
pie
    "Rats" : 15



The way this is currently implemented is by doing a pre-processing step on all diagrams, so doing the same thing with directives wouldn't be too weird.

> Can some verify I haven't forgotten anything? [link](https://github.com/Yokozuna59/mermaid-lint/tree/master/packages/mermaid-parser/tests/common)

From a quick look, it looks great! Great work :+1:! The `it.todo` tests are probably the most important unfortunately, otherwise it would break existing diagrams.

sidharthv96 commented 1 year ago

I agree with @aloisklink, yaml and init directives can be handled externally for now as they aren't super important for our core usecase which is formatting and parsing the actual diagrams.

Yokozuna59 commented 1 year ago

Hmmmm, this is kinda a hack, so sticking it langium is probably the ideal way of doing this...

I can create a general directive parser (still not easy), but it won't be strictly typed. So it won't suggest anything if the user tries to write some directives, unlike the strictly which would suggest according to what the user has typed using LSP. For example:

...but if you're feeling lazy, I wonder if it's easier to do this a pre-proccess step, e.g. something like:

I tried tbh, but it would hard to preform with my experience with langium. It can be removed using hidden terminal rule:

hidden terminal DIRECTIVE: "%%{" -> "}%%\r?\n";

It seems like directives are more like a metadata thing that applies to all diagrams, rather than something part of each diagram, so it might make sense to split it away from the diagram parser shrug

Not really sure tbh, but I won't think about it for now and will focus on other stuff.

Just thought of something else, all mermaid diagram definitions support using a YAML front-matter before the diagram, e.g. like:
---
# some YAML here
title: "Hi"
---
pie
    "Rats" : 15
The way this is currently implemented is by doing a pre-processing step on all diagrams, so doing the same thing with directives wouldn't be too weird.

I think it won't be that hard to implement this since it's wrapped differently from comments (%%) and only appears at the top of the file, but I'm not really sure what possible values can be there since I couldn't find them in the docs.

From a quick look, it looks great! Great work +1! The it.todo tests are probably the most important unfortunately, otherwise it would break existing diagrams.

I was trying to create a general regex that make sure that there is no accTitle or accDescr before title but appears that it won't be easy. You can try it here and please inform if you could come up with one. https://regex101.com/r/rGuOBM/1

Yokozuna59 commented 1 year ago

I figured out the solution for those it.todo test cases; it appears that I was using Langium in the wrong way. However, there is a limitation on the Langium side that I couldn't solve completely. If this PR merged langium/langium#1067, then I would be able to continue working on the parser.

sidharthv96 commented 1 year ago

@Yokozuna59 will be developing this package in their repo, and we'll move it into mermaid once it's stable for production use. This will replace mermaid's current JISON parser once the team decides.

I wonder if there's any way to verify that the new parser is backwards compatible with the syntax supported by the old parser?

Yokozuna59 commented 1 year ago

I wonder if there's any way to verify that the new parser is backwards compatible with the syntax supported by the old parser?

I'm currently creating unit tests for common stuff, for example, title, accTitle, accDescr, etc. And will add unit tests for each diagram. Currently, only pie chart and the above-listed common stuff have test cases. And I add more test cases for each edge case and valid syntax while developing. I could take the test cases for the current parser and add them to mine to validate that they're working fine.

What do you suggest?

Yokozuna59 commented 1 year ago

Should I support multiline comments #1281 with it? I came across it while a ago. Please mention me for similar stuff if you could since there isn't a parser label in mermaid issues. I would suggest adding one if you can! Maybe adding other label like renderer and lsp would be helpful for other contributors.

sidharthv96 commented 1 year ago

@Yokozuna59, let's keep your implementation as close to the current parser as possible. We can add new features and fix bugs once we make the switch. Otherwise it'll be difficult to test.

Regarding testing & verification, unit tests and visual tests will definitely be required and helpful, but they won't cover all the usecases.

Two ideas I had (which seems a bit over engineered) were,

Fuzzing the current parsers to find many valid edge case unit tests, and then running that suite on the new parser.
Verifying syntax diagrams. But this requires syntax diagrams to be generated from the current jison files, which might be difficult.

Anything else?

sidharthv96 commented 1 year ago

If your pie implementation is ready, can you raise a PR like https://github.com/mermaid-js/mermaid/pull/3432, so we can test the size increase?

Yokozuna59 commented 1 year ago

@Yokozuna59, let's keep your implementation as close to the current parser as possible. We can add new features and fix bugs once we make the switch. Otherwise it'll be difficult to test.

Okay, I'll remove the multi line comment for now.

Regarding testing & verification, unit tests and visual tests will definitely be required and helpful, but they won't cover all the usecases.

Two ideas I had (which seems a bit over engineered) were,

Fuzzing the current parsers to find many valid edge case unit tests, and then running that suite on the new parser.

Verifying syntax diagrams. But this requires syntax diagrams to be generated from the current jison files, which might be difficult.

Anything else?

I'm not really sure how I can help with that. Before implementing the parser in langium, I tried looking at those jison files, but I wasn't able to understand much or get edge cases. If you point them out I would be more than happy to verify and test them.

Yokozuna59 commented 1 year ago

If your pie implementation is ready, can you raise a PR like #3432, so we can test the size increase?

The Langium PR haven't been merged yet, so the title and accessibilities would give wrong result:

- awesome title
+ title awesome title

There is an alternative way to implement it; thanks to @msujew for pointing it out, I'll implement it then create a PR.

sidharthv96 commented 1 year ago

@Yokozuna59 you can focus on langium for now. Don't worry about the verification part. I was sharing that to get input from the community.

Yokozuna59 commented 1 year ago

@sidharthv96 I have updated the above issue and this is the results for the current test cases:

Most common cases are handled, should I create a PR? Or figure out how to solve those issues? Some of them appears aren't from my side.

Yokozuna59 commented 1 year ago

Never mind, all of the above issues were solved in mermaid-parser@0.2.0.

I guess Pie Chart is now completely supported (without directives or yaml).

NicolasCwy commented 8 months ago

@sidharthv96 I would like to convert the Git Graph parser to use Langium, can I tag along this issue or should I create a new one? Perhaps we can add a checklist of the remaining parsers and make this an epic of sorts

sidharthv96 commented 8 months ago

@NicolasCwy you can start with a draft PR directly, no need to create an issue. :)

The flow what we've seen till now is usually

Stage 1

Convert DB to TypeScript
Optimise the DB functions to read from AST (if required)
Tweak any unit tests to fit the new DB.

Stage 2

Add langium parser in the parser package
Add a parser file in mermaid that interfaces with the parser package and populates the DB
Do not modify any visual tests.

Stage 1 and 2 can be separate PRs, so we can get stuff in easily.

Few reference PRs

4799
4839
4751
5247

mermaid-js / mermaid

feat(parser): reimplement parser in `langium` instead of `jison` #4401

4

873

2523

3261

3694

3100

2566

2208

1810

3530

Stage 1

Stage 2

4799

4839

4751

5247