Open clason opened 4 years ago
This seems perfectly reasonable. In my opinion, ChkTeX is a bit annoying, too. In particular #304 and #186 still linger around. Things like detecting mismatched braces and mismatched environments are pretty easy to implement, since we already do that when parsing. At the moment, I am working on making the parsers for LaTeX and BibTeX incremental so TexLab does not need to reparse on every keystroke and some general parsing improvements. I am almost done with LaTeX parsing and BibTeX should be pretty easy. At the moment, the incremental LaTeX parser is able to reparse large with more documents in less than 1 ms on my machine. This should solve most performance problems.
That's great; I was hoping that these would be rather easy as byproducts of the parsing. Would it make sense to just push any parsing errors straight to diagnostics? Otherwise it's probably best to start simple and add errors as they are suggested.
Incremental parsing sounds like an excellent improvement; I'm quite excited about that. On that note, have you looked at tree-sitter for parsing? There's no (good) grammar for it yet, but it's built for fast, incremental, error-recovering parsing.
Would it make sense to just push any parsing errors straight to diagnostics?
Yes, the BibTeX parser already does something like that.
On that note, have you looked at tree-sitter for parsing? There's no (good) grammar for it yet, but it's built for fast, incremental, error-recovering parsing.
I have looked into tree-sitter. Actually, I have two grammars for LaTeX and BibTeX lying around. For BibTeX, tree-sitter is working fine and is producing good results. However, for LaTeX, it has performance problems. I think, the problem is, that LaTeX parse trees are very shallow and not deeply nested like most languages. In particular, parsing plain text seems to be the problem. With no modification to the grammar, tree-sitter is slower at parsing incrementally than our old parser, which is not incremental. The performance can be improved by storing the text as a linked list, which is convenient for the parser because it can reuse more nodes but it makes it harder to work with. In addition to that, I encountered lots of problems with the borrow checker when trying to make the semantic analysis incremental too.
At the moment, I am using a hand-written incremental packrat parser based on the following paper:
Patrick Dubroy and Alessandro Warth. 2017. Incremental packrat parsing. In Proceedings of the 10th ACM SIGPLAN International Conference on Software Language Engineering (SLE 2017). Association for Computing Machinery, New York, NY, USA, 14–25. DOI:https://doi.org/10.1145/3136014.3136022
The hand-written (and not that much optimized. There is stil a paper left which optimizes the invalidation process of the memoization table) parser is about 4 times as fast as the optimized tree-sitter parser and way easier to use. For the syntax tree design, I mostly followed the way Roslyn handles parse trees (green and red trees): Boths trees are immutable. The green tree is constructed bottom-up and can be reused in the next parse, while the red tree is constructed top-down on-demand and is thrown away every parse.
Thank you, that's very interesting to hear. The packrat parser sounds fascinating, really cool!
Would you be willing to share the tree-sitter parsers? They might not be competitive for texlab, but neovim has started leveraging tree-sitter for an increasing variety of things (syntax highlighting, folding, indentation, structural text editing), and it can't be slower than the current regex parsing :) If it was made public, I'm sure there are quite a few people interested in contributing.
Would love this, chktex can be useful and very annoying at the same time.
Would you be willing to share the tree-sitter parsers?
I think https://github.com/Aerijo/tree-sitter-bibtex/blob/master/grammar.js already does a good job for BibTeX. It just needs to be modified to handle Windows line endings as well but this change should be straightforward. Regarding the LaTeX grammar, I can say that our grammar is very much suited to the use in TexLab. In particular, the grammar now needs to be updated to implement parsing of key value pairs as well in order to implement #324. Once that one is done, I can think I can update the tree-sitter grammar and share it. Maybe some one has a use for it.
I think https://github.com/Aerijo/tree-sitter-bibtex/blob/master/grammar.js already does a good job for BibTeX. It just needs to be modified to handle Windows line endings as well but this change should be straightforward.
Thank you, I'll take a look at it! (EDIT Seems it's not compatible with the latest version of tree-sitter. Bummer.)
Regarding the LaTeX grammar, I can say that our grammar is very much suited to the use in TexLab. In particular, the grammar now needs to be updated to implement parsing of key value pairs as well in order to implement #324. Once that one is done, I can think I can update the tree-sitter grammar and share it. Maybe some one has a use for it.
I think there's definite interest in the https://github.com/nvim-treesitter/nvim-treesitter community! As you wrote, the grammar will probably have to be adapted to that use, but that's no problem -- it'd very helpful to have a working parser to start from. (I looked into writing one from scratch, and quickly got lost...) As such, I wouldn't worry about putting in any more work into the grammar if you're not using it yourself. It's also not urgent by any means, so whenever you get around to it would be fine.
@clason Do you think, there are more types of (safe) errors that we can report after parsing/analysis that are not annoying for the user? Unfortunately, LaTeX is not as restrictive as BibTeX for example, where we can tell for sure that something is an error.
If you can catch undefined commands or environments, that might be useful?
I thought about this one too, but it can be very hard in some cases. Especially if you delve into TeX territory. I am currently updating the latex-completion-data
script with a new (more accurate) method of determining the defined commands and environments. Maybe this can be helpful. I think that this should definitely be a warning in the beginning due to the number of false positives that can occur but I'll keep it in mind.
Yeah, I assumed that this was hard to get (right). Honestly, mismatched or missing braces and environments pretty much already cover 99% of syntax errors I make, so coming up with more is hard ;)
Maybe math commands outside math mode? But you're right, people can redefine them at will, and there's no macro definition so crazy that someone isn't using it...
The only thing that might be safe is (multiple) subscript/superscript?
The only thing that might be safe is (multiple) subscript/superscript?
This should definitely be doable and pretty easy to implement with the current parser.
Another thing I would like is to control whether compiler errors and (especially!) warnings are forwarded to diagnostics. (I do not need to see all the overfull boxes
... unless I want to verify that the server is running ;))
Another thing I would like is to control whether compiler errors and (especially!) warnings are forwarded to diagnostics. (I do not need to see all the overfull boxes... unless I want to verify that the server is running ;))
I agree with you that the overfull box warnings are mostly useless as some of them simply cannot/should not be fixed. Do you think we need a more sophisticated filtering mechanism or is a simple flag for errors and warnings enough at the moment?
Being able to disable warnings (and optionally errors) is already useful. But I'm sure there will be requests for increased granularity eventually...
VimTeX allows providing a list of regular expressions which are matched against diagnostics to filter out, so you could just list { 'LaTeX Warning', 'Marginpar on page', 'Overfull', 'Underfull' }
and what else bugs you.
I think both would be of independent use (some just want to turn off warnings or even all output; others will want more fine-grained control). Start with the easiest, I'd say, and wait until someone requests more ;)
Another thing I would like is to control whether compiler errors and (especially!) warnings are forwarded to diagnostics. (I do not need to see all the overfull boxes... unless I want to verify that the server is running ;))
@clason There is a PR (#655), which provides simple support for filtering diagnostics based on regular expressions. What do you think?
I don't know your plans for texlab 3.0 (and I hesitate to put more on your plate), but would it make sense to add basic(!) static linting to
diagnostics
for TeX files, similar to what you have for Bib files?I don't mean anything like
ChkTeX
(which is too opinionated for my taste anyway), just some basic syntax errors like\frac{e^{\pi}{2}
)\begin{equation*}...\end{equation}
)(I usually find the TeX compiler output less than helpful, either due to opaque messages or to wrong locations especially with beamer, and these things would be nice to catch before compiling a 400 page manuscript and failing...)