millefoglie / latex-dom

LaTeX DOM Parser
0 stars 1 forks source link

Is 'latex-dom' still being developed? #3

Open imagingbook opened 1 year ago

imagingbook commented 1 year ago

@millefoglie Hello, I forked and played with your code some time back (~2 yrs) and found that it has been refactored quite a bit. I liked the clean structure of the code but also found quite a few deficiencies. My intention is to use it for a LaTeX linter, a project I started but had not time to finish and would like to resume now.

What is the state of this project, is it still being developed?

millefoglie commented 1 year ago

Hi! Thank you for having interest in my work.

I have kind of postponed working on it, because I also had other things to do. And as a whole it was more of an experiment. Most likely, there are better ways to write parsers. However, I still have some plans to get back to it to finish what I had on my todo list, mainly adding node builders for manipulating the DOM tree, and then publish it under some license.

Is there anything in particular that you find missing?

imagingbook commented 1 year ago

Hello back, thanks for the response! The situation is quite similar here: I stopped my (private) project 2.5 yrs ago and got back to it again now, only to find out that I had forgotten most of what I did. In the course of recovering my status I looked over your project once more, to find out if it makes sense to switch over to your parser. Today I quickly set up a few of my tests to check the general behavior. It is far from clear what to expect from a good LaTeX parser since everybody knows that a perfect parser is practically impossible to build. Eventually I could adapt to whatever DOM the parser delivers, as long as it contains enough information. E.g., in my application it is very important to know whether some item is in math mode or not, also proper handling of comments and verbatim items.

If you are interested you can find my trials in this branch: https://github.com/imagingbook/latex-dom/tree/wilbur As you asked, a few things I noticed:

My own parser is based on the PEG technique and I wrote it from scratch. It works in most situations, nevertheless I cannot handle all special cases that occur in practice. If you are interested I'd be happy to share (it is not published yet).

All the best, Wilhelm

millefoglie commented 1 year ago

Hi,

To be honest I'd also try something like PEG instead of doing it all manually. And maybe I'll migrate to it eventually. But as I don't have much experience with parsers, and it's just an experiment now, I don't mind the naive approach. Plus, I have some doubts on how easy it is to write a grammar for LaTeX.

Anyways, regarding your points. I think I faced or fixed something similar to that lost last token. But I might be wrong.

The brackets, as I remember, aren't matched, or only matched while reading a command definition, e.g. \cmd{..}{..}[..]{..}. It's hard to tell if [...] belongs to a command, or it's just text. And nothing forbids to have just a single character like [, (, ), ] not enclosing anything. So, $[0,1)$ shouldn't treat brackets/parentheses in any special way for now.

For the rest, I don't really have an answer now, but it would help if you could share a sample file where things get broken.

imagingbook commented 1 year ago

I perfectly understand and did not really expect anything to be fixed. It was important for me to be clear about the state of the project. LaTeX parsing is tricky terrain and I finally decided to focus on my own implementation (once again).

Nevertheless, if you want to look at some of the mentioned test cases you find them here: https://github.com/imagingbook/latex-dom/blob/wilbur/src/main/java/wilbur/StringInputTest.java

millefoglie commented 1 year ago

Hi! Sorry for a late reply. I had a look at some of strings in your test. Indeed, there are a couple of bugs in my parser.

imagingbook commented 1 year ago

Hello again! Yes, LaTeX does a lot of strange things. My comments where not meant as bug reports, just examples of things I am struggling with myself. I just haltet my own parser project because I was getting deeper and deeper into this mess.

Since you asked, I have a private LaTeX document that I use for testing various tricky situations. You can find in here: https://github.com/imagingbook/latex-dom/tree/develop/latex-tests

imagingbook commented 1 year ago

Btw, a major requirement in my application is to find out for each DOM node if it is in "text" or "math" mode. For example, I need to handle constructs like

... this is text mode $math mode \text{in text $more math$ now} x_3^n$ and back to text ...