Is 'latex-dom' still being developed?

imagingbook commented 1 year ago

@millefoglie Hello, I forked and played with your code some time back (~2 yrs) and found that it has been refactored quite a bit. I liked the clean structure of the code but also found quite a few deficiencies. My intention is to use it for a LaTeX linter, a project I started but had not time to finish and would like to resume now.

What is the state of this project, is it still being developed?

millefoglie commented 1 year ago

Hi! Thank you for having interest in my work.

I have kind of postponed working on it, because I also had other things to do. And as a whole it was more of an experiment. Most likely, there are better ways to write parsers. However, I still have some plans to get back to it to finish what I had on my todo list, mainly adding node builders for manipulating the DOM tree, and then publish it under some license.

Is there anything in particular that you find missing?

imagingbook commented 1 year ago

Hello back, thanks for the response! The situation is quite similar here: I stopped my (private) project 2.5 yrs ago and got back to it again now, only to find out that I had forgotten most of what I did. In the course of recovering my status I looked over your project once more, to find out if it makes sense to switch over to your parser. Today I quickly set up a few of my tests to check the general behavior. It is far from clear what to expect from a good LaTeX parser since everybody knows that a perfect parser is practically impossible to build. Eventually I could adapt to whatever DOM the parser delivers, as long as it contains enough information. E.g., in my application it is very important to know whether some item is in math mode or not, also proper handling of comments and verbatim items.

If you are interested you can find my trials in this branch: https://github.com/imagingbook/latex-dom/tree/wilbur As you asked, a few things I noticed:

There is a problem at the end of the input string or file, causing the last token to be dropped.
White space handling after comments and newlines does not quite comply with TeX rules.
For nodes on the same level, the first sibling always has a 'null' parent, the others are OK.
Commands are properly recognized (mostly) but it seems their names are lost (easy to fix).
Non-balanced brackets are not always wrong (e.g., the interval $[0,1)$).
I tried to parse two small-sized but valid LaTeX files without success.
It would be valuable to maintain text position with the nodes, so to easier locate errors.

My own parser is based on the PEG technique and I wrote it from scratch. It works in most situations, nevertheless I cannot handle all special cases that occur in practice. If you are interested I'd be happy to share (it is not published yet).

All the best, Wilhelm

millefoglie commented 1 year ago

Hi,

To be honest I'd also try something like PEG instead of doing it all manually. And maybe I'll migrate to it eventually. But as I don't have much experience with parsers, and it's just an experiment now, I don't mind the naive approach. Plus, I have some doubts on how easy it is to write a grammar for LaTeX.

Anyways, regarding your points. I think I faced or fixed something similar to that lost last token. But I might be wrong.

The brackets, as I remember, aren't matched, or only matched while reading a command definition, e.g. \cmd{..}{..}[..]{..}. It's hard to tell if [...] belongs to a command, or it's just text. And nothing forbids to have just a single character like [, (, ), ] not enclosing anything. So, $[0,1)$ shouldn't treat brackets/parentheses in any special way for now.

For the rest, I don't really have an answer now, but it would help if you could share a sample file where things get broken.

imagingbook commented 1 year ago

I perfectly understand and did not really expect anything to be fixed. It was important for me to be clear about the state of the project. LaTeX parsing is tricky terrain and I finally decided to focus on my own implementation (once again).

Nevertheless, if you want to look at some of the mentioned test cases you find them here: https://github.com/imagingbook/latex-dom/blob/wilbur/src/main/java/wilbur/StringInputTest.java

millefoglie commented 1 year ago

Hi! Sorry for a late reply. I had a look at some of strings in your test. Indeed, there are a couple of bugs in my parser.

"a comment%like this\n in the text" - this one should be ok. The way it's written, it actually is a 2-line string, so only like this is a part of a comment. Or I didn't get it.
"the line is broken \\[6pt]before these words" - a valid point, I forgot about in commands
" math text $x = 3^y \in [0,1)$ with unbalanced brackets" - this is a bug, [ is treated as an argument of \in command. I'm not sure what to do with brackets in general, but here the parser should replace a BracketNode with a list of its child nodes when closing the scope.
"This is \verb*!verbatim \foo %&§() text! followed by some more." - this is not supported, and I've never seen this before. However, \begin{verbatim} should work.
"a European comma like 15,23 is OK in text, but is wrong in math mode!" - what's wrong with the comma?
"Using \"<French\"> quo\-tation marks is less-frequent." - quotation marks should be fairly easy to add, though again I've never seen this before, only << and >>. It would help if you could share a link to some document where I could find all these special cases. And hyphens shouldn't be handled, I guess. Or at least I don't know how to handled them right now.

imagingbook commented 1 year ago

Hello again! Yes, LaTeX does a lot of strange things. My comments where not meant as bug reports, just examples of things I am struggling with myself. I just haltet my own parser project because I was getting deeper and deeper into this mess.

"a comment%like this\n in the text": the issue here is that the trailing comment in the first line also consumes the leading white space in the following line. I.e., there is no space between ... comment and in ...
The inline \verb ... macro occurs very frequently (I also use it a lot). But it is easy to parse because there is no nesting and it may not wrap over lines.
European commas: if you write $15,23$ in math mode, LaTeX inserts a space after the comma (unlike in $15.23$ ). Not sure if this is relevant to parsing.

Since you asked, I have a private LaTeX document that I use for testing various tricky situations. You can find in here: https://github.com/imagingbook/latex-dom/tree/develop/latex-tests

imagingbook commented 1 year ago

Btw, a major requirement in my application is to find out for each DOM node if it is in "text" or "math" mode. For example, I need to handle constructs like

... this is text mode $math mode \text{in text $more math$ now} x_3^n$ and back to text ...

millefoglie / latex-dom

Is 'latex-dom' still being developed? #3