[Discussion] Advancement

Hello Tom,

Sorry for barging in, but I didn’t really find another way to get news on that front (at least org-mode ML didn’t bring any results, and to be honest I’d rather keep it discreet).

Last time we interacted, I remember that you proposed to try to make a converter from your parser to a tree-sitter grammar, and it looked like the best way to go, since it would take me literal months (that I don’t have) to try to follow the changes in your parser.

Current state of the parser

I see that the scanning/lexing part (that’s how I see the raw parser.rkt file) is a lot leaner that last time I tried to translate it to TS. Do you think it’s going to be enough to grasp all the elements we’d like before doing actual parsing ? I’d be happy if it were the case, I really don’t know. (I’m currently reading the beginning of Crafting Interpreters hoping to have "aha" moments and have better understanding of what I’m trying)

Current state of the tests

Simple : I saw that you added tests in the repo. What kind of "expected results" format do you think would be the best to formalize/centralize the spec-as-tests you’re currently building ?

Feasability of tree-sitter-org based on laundry

Even if I’m not the one ending up doing a TS parser (looking at what TS support for Markdown is, I actually have shivers thinking about the work to be done), I think this approach of

generating the scanner/lexer from laundry
manually translate parsing logic into TS API (i.e. inside bool tree_sitter_org_external_scanner_scan(void *, TSLexer*, const bool*))
generate a corpus (the test format for TS) from a central/common test cases.

is currently the safest way to go. But if you have any comments on that I’d be glad to hear them.

At least if that’s the case, my next step would "only" be to translate your tests into a full sized TS test/corpus, which is both less daunting to me, and probably the best usage of my time I can do to try to get org-mode support in TS.

Have a nice day

A quick summary of my findings since the last time we discussed this (more to come later)

The grammar is written on the master branch is a nightmare to maintain, and is fundamentally ambiguous for pretty much the exact reason mentioned in the https://tree-sitter.github.io/tree-sitter/creating-parsers#writing-the-grammar section of the TS notes.
The top level grammar has been simplified and most of the complexity has moved into the lexer in order to remove ambiguity from the grammar. I think that in theory some of the ambiguity could be managed with a peg parser, but tree sitter does not support peg grammars.
Nested lexers and parsers for each top level element have been added. In some cases it is possible produce a single stream of tokens using nested lexers as part of an external scanner. Org will definitely need an external scanner. However there are a number of edge cases where a grammar has to interpose between two lexing steps. In some cases someone might find a clever way around the issue, but in others it is just bad news and a pain in the butt (to the point where I am considering proposing changes to the org syntax to avoid certain pain points).
There are two lexers implemented now, one for expansion and one for coloring and I'm trying to reuse as much machinery between the two as possible, but the coloring lexer has slightly different restrictions. The experience here tells me that the external scanner is going to be critical for a TS implementation.
I think that the work I have been doing in lex-abbrev.rkt can help because it establishes many of the regex patterns that are needed.
At some point I'm going to write an experience report on this which should make it easier to go about creating the TS lexer and grammar.
I've only just gotten to the point in the grammar where I am starting to find the deep edge cases, ambiguities, irregularities, etc in the interactions between different parts of the grammar (particularly for org objects in paragraphs).
Overall I think this is good news for writing a tree sitter grammar because we can move most of the complexity into an external scanner.
The high level grammar that is based on tokenizing whole lines is something that could go into a tree sitter right now, however if you were to do that there is a good chance that the architecture would end up being hard to extend to include the reparse of the nested elements. Given this and the fact that there are a bunch of gnarly issues that need to be worked out due to inconsistencies within the elisp implementation, I think it probably makes sense to hold off on TS work for the time being, because I expect it will be significantly easier in the future after we have work through the known issues.

With regard to the format for the expected output for tests. I use laundry's intermediate representation in partially expanded s-expression form as the target right now. However, it is still evolving so some names could change, though it is slowly approaching stability. I think that talking about org parsing semantics in terms of the s-expression form that they should parse to is a reasonably good idea idea. It also happens to be a format that is easy for this implementation to test against. An alternative would be to test via a round-trip. But that is something that tree sitter is unlikely to be able to do easily. I'll have to look into what might be possible for TS.

2. The top level grammar has been simplified and most of the complexity has moved into the lexer in order to remove ambiguity from the grammar.

Cool, that will allow to put more focus/stress on the hard part, which is definitely going to be that external scanner.

However there are a number of edge cases where a grammar has to interpose between two lexing steps. In some cases someone might find a clever way around the issue, but in others it is just bad news and a pain in the butt (to the point where I am considering proposing changes to the org syntax to avoid certain pain points).

Do you have specific examples in mind ? I'm not too familiar with language theory. I agree that an external scanner will be mandatory anyway, and TS allows to somewhat mix external scanner types with the "classic" grammar.js types if I remember correctly, since you always pass a list of "possible" token types to the scanner function. But I am not sure if that's what you meant.

5. I think that the work I have been doing in [lex-abbrev.rkt](https://github.com/tgbugs/laundry/blob/next/laundry/lex-abbrev.rkt) can help because it establishes many of the regex patterns that are needed.

lex-abbrev is probably going to be most of the base types for the grammar.js file. I say "most" because one challenge I dread is being able to have different TODO states keywords in a file, and you seem to do so as well :)

https://github.com/tgbugs/laundry/blob/5a396bef98d9a3cd9ee929f21cd47612dd6cb1ac/laundry/lex-abbrev.rkt#L105

Currently my "wild" idea would be to support it by

having a #+TODO_KEYWORDS property in the files where you don't want the default
having the TS parser update its rules per file based on that.

The only way I see this solution working is if it's implemented entirely in the external scanner, since it's the only place where we have some state we can manipulate at runtime for TS parsers as far as I understand.

6. At some point I'm going to write an experience report on this which should make it easier to go about creating the TS lexer and grammar.

Definitely, very muchy, can't wait-y looking forward to this

7. I've only just gotten to the point in the grammar where I am starting to find the deep edge cases, ambiguities, irregularities, etc in the interactions between different parts of the grammar (particularly for org objects in paragraphs).

Looking at that cursed.org file was fun :)

8. Overall I think this is good news for writing a tree sitter grammar because we can move most of the complexity into an external scanner.

:+1:

9. The high level grammar that is based on tokenizing whole lines is something that could go into a tree sitter right now, however if you were to do that there is a good chance that the architecture would end up being hard to extend to include the reparse of the nested elements. Given this and the fact that there are a bunch of gnarly issues that need to be worked out due to inconsistencies within the elisp implementation, I think it probably makes sense to hold off on TS work for the time being, because I expect it will be significantly easier in the future after we have work through the known issues.

Makes sense yes. Anyway I am not close to good enough to make it happen now, and I'll soon go back to working, so I'll have less time to give to this effort.

With regard to the format for the expected output for tests. I use laundry's intermediate representation in partially expanded s-expression form as the target right now. However, it is still evolving so some names could change, though it is slowly approaching stability. I think that talking about org parsing semantics in terms of the s-expression form that they should parse to is a reasonably good idea idea. It also happens to be a format that is easy for this implementation to test against. An alternative would be to test via a round-trip. But that is something that tree sitter is unlikely to be able to do easily. I'll have to look into what might be possible for TS.

Round-trip seems too costly to do, because that means I'd also have to work on some ways to write org files from a tree, which another challenge as well. It'd be nice to have that for sure, but the cost seems really high.

As I hinted towards the end of my message, I kickstarted a simple converter from .org to tree-sitter test files, entirely based on org-element tree with a small transformer : https://github.com/gagbo/org-to-tree-sitter-corpus

Currently I target the tree-sitter S-expressions mostly because I didn't look enough into laundry to find the tests, but I'm very open to suggestions to the format anyway. For the time being I'll incrementally raise the coverage of org-element node types, until this can be properly used to have Emacs emit an AST without all the extra information that we don't need for "just a parser"

For the record, I added ERT tests to show the current format of S-expressions that I'm targeting:

https://github.com/gagbo/org-to-tree-sitter-corpus/blob/db09c75c50b9c701fa18adef50e42d6c75aad84a/org-to-tree-sitter-corpus-tests.el#L42-L43

Of significant interest https://github.com/milisims/tree-sitter-org. @milisims for awareness.

This has way more chance to reach maturity than what I'm doing, happy to see it. Hopefully I'll reach at least parity on the types of elements parsed between my EmacsLisp generator and the current corpus, and then try to get incrementally more coverage

I think that the corpus generator is of great interest for all the projects. Being able to interconvert so that everyone has access to the same test cases and expected results should allow us to communicate more clearly about any divergence we see on certain tests.

Hey cool! I don't really know anything about racket, but I'm happy to play in where possible. One thing to note when considering tree-sitter parsing is that the design of TS is largely meant to facilitate permissive grammars, which can be made more specific via queries at runtime.

For example, rather than using a messy external scanner (which was my first attempt), in the grammar the first word in a headline is assigned an anonymous node called "keyword?", meant to imply that it is a candidate for a keyword. Then a query can find it clearly and simply with something like (item "keyword?" @keyword (#eq? @keyword "TODO")). The queries are easily generated at runtime, allowing for user customization. A similar approach is used for allowing what markup leads to bold, italics, etc. Arguably, it might be smart to do the same thing for markup candidates rather than parsing the markup regions specifically, they are causing me some issues..

As for my current implementation, I definitely made a few Decisions to ignore or change a few details separate from the emacs orgmode spec, which I didn't document :(, and those were chosen to simplify the grammar where I could. A simple example is that sections include the headline, and the orgmode syntax's section is closer to the (body) node in my grammar.

Another consideration for generating the parser might be that conflicting tokens can be hard to wrap your head around, in my experience a lot of trial and error was necessary. Taking a peek at conflicting tokens, in particular #4, it was necessary for me to awkwardly hand write some conflicts so that the parser would consider that the currently matched characters could be just text or some other syntax element, and it's not necessarily an error if the text element isn't matched. It could just be text.

I think that the corpus generator is of great interest for all the projects. Being able to interconvert so that everyone has access to the same test cases and expected results should allow us to communicate more clearly about any divergence we see on certain tests.

This would definitely be super cool!

@gagbo FWIW, I also started work on a corpus. As it says on the project, I’m happy to collaborate/share.

Also, I think this idea is worth exploring if I do say myself.

Hi,

Corpus

I also started work on a corpus.

Sure it looks pretty cool:

I don’t see myself trying to build manually all test cases, knowing that I’m not a very proficient (read "trying to use all syntactic constructs ever") org-mode user.
Your dumptree function looks like a subset of what I aim to provide with mine, mostly because I want to parse the metadata along with the element type, in order to output a more precise tree for tree-sitter test.

Basically, from what I gather, running dump-tree on

* Has a tag :tag3:

would give

(org-data (headline))

whereas I want the more precise

(org_data (headline (stars) (title) (tags)))

because if Tree-Sitter is to use such a grammar, we need all the tokens for structural editing.

My goal with ottsc is to be the easiest way to create a set of corpus tests for Tree-Sitter, directly from emacs-lisp implementation, since the spec is a little hard to navigate currently. For example, if there’s an issue coming in kristijanhusak/orgmode.nvim because something is behaving differently in Neovim and in Emacs, we can directly extract the snippet of .org content, run it through this EmacsLisp function, and get an extra test to add to milisims/tree-sitter-org so that it’s easier to understand what Emacs sees vs. what the tree-sitter grammar sees.

Other stuff

Also, I think this idea is worth exploring if I do say myself.

LSP

I’m not a huge fan of the lsp-server for org-mode, just because if you need to run an emacsen as the server, I might as well just edit the org-mode file in Emacs. Having an LSP-server that is not an emacs process is basically the same problem as writing a parser that’s not elisp, so I’m assuming that "LSP server" means "make an Emacs package that follows the server-side spec of LSP, and plug it to org-mode files".

@tecosaur doesn’t think so though and might have a different opinion.

Separating parsing from editing

I don’t think I agree with the idea that "consumers" and "producers" of syntax should be different. It is true that while editing almost all the time the buffer is in an erroneous state regarding the expected grammar. That’s one of the reason tree-sitter got popular and is performant in my opinion: you give tree-sitter the actual grammar, and it’s TS’ role to recover grammar errors and produce an AST taking this into account. It doesn’t mean that the editor should ignore the syntax because the buffer is almost always bad.

I think that the separation between org-readers and org-editors mostly come from the difficulty of manipulating the tree once it has been parsed. But there are example of tools that do both like python parsers, I don’t think that still thinking both operations as separate is really good. I really really like the structural editing approach that tree-sitter can provide, and it’s also part of the appeal of org in Emacs: how can you M-x org-refile a sub-tree (or archive it, or manipulate its deadlines and all of its child nodes’) if you don’t have in your editor a parser that can

read all your org-directory/org-agenda-files files,
parse those, and
extract headlines, their properties, and their locations

I think we all agree that Org-mode not having an easy EBNF grammar, and being mostly specced from org-element.el is not ideal. Tom and others’ efforts and conversations on org-devel mailing list try to make the grammar more strict and specified to get out of this situation. If I remember correctly, there was even a green light from either Bastien Guerry or Nicolas Goaziou to make the syntax stricter and breaking "legacy" org files if it means that the grammar can be written to be easier to understand and easier to write a parser for. But it will take time, and meanwhile I mostly just want to keep exploring org-mode syntax one way or another to see what can be done to simplify its parsing and maybe get an easily embeddable parser so we can stop being vendor-locked to Emacs because of Org.

I'm having a hard time following your explanation. Mostly my point is that there need not be a tight coupling of the the parser implemented in emacs and users of the file format. I'm assuming you want to use the format outside of emacs, because there really is no point otherwise. The various org mode parsers out there in various languages are basically trying to make use of the format outside of emacs. All this speaks to the need for either a looser coupling or for emacs itself to use a more portable spec with an EBNF. A looser coupling means we could have the best of both worlds while we wait for the emacs implementation to use a more well defined spec.

More simply, we don't need emacs to enforce a stricter grammar. We can let a linter enforce the grammar. For people who need stuff to conform to a stricter grammar, they would have to use the linter. This can happen at the same time that the emacs folks do whatever they want with org-element.el. You don't have to use a linter, but you can choose to use a linter. This not about "making emacs follow a server side spec".

More specifically, outside of emacs they can use any of these tools which parse the format (which have been written with a strict grammar) but when they work inside of emacs, they can use a linter to make sure it's compatible with the tools used outside of emacs.

I'm not sure what you see producing this corpus. You say "directly from emacs-lisp implementation", but I don't know what that means. The org mode code doesn't generate files, it only parses them. Then you talk about "extracting" from an issue in a Lua implementation of org-mode. My guess is that you mean that someone has produced an org file to show a difference between that implementation and the Lua implementation. The important point is that someone must write that file.

In any event, if you're not interested in collaborating that's fine. It sounds like you're not, though it's not clear to me that you've understood the suggestions.

I guess I was unclear once again, I can never explain things in github issues it seems. Anyway this is getting off-topic as far as I understand so it doesn’t really matter now.

The point of producing corpus from org-element is to get a non-regression testing suite for anything that wants to parse org; I don’t want to try writing a spec/grammar entirely new and external like neorg if I’m not discussing this on emacs-orgmode mailing list beforehand. There are going to be weird stuff happening when relying on org-element for sure, but I was hoping to show these weird cases upstream (in the mailing list) and ignoring these edge cases in other tools until it’s solved in org-mode proper.

I see. That description says nothing about avoiding writing tests manually. I don’t think there is any way to avoid writing tests manually.

If you take a look you’ll see my project provides a collection of samples which could be used for this purpose. If you look at the description, it describes a very similar goal to the one you describe.

The suggestion to provide a “linter” for org-mode is secondary but I think fits in with the overall aim as I described above. Waiting for org-mode itself to be built on a EBNF may take a very long time. A linter would assure compatibility in the meantime for users who choose to use it.

tgbugs / laundry