tecosaur / LaTeX-Utilities

An add-on to LaTeX Workshop that provides some features that go beyond the bare essentials
MIT License
495 stars 30 forks source link

Linter for grammar #39

Closed PierreMarchand20 closed 2 years ago

PierreMarchand20 commented 5 years ago

Feature Request

Is your feature request related to a problem? Please describe.

Since the issue has been closed in https://github.com/James-Yu/LaTeX-Workshop/issues/881#issuecomment-533659579, I open an issue here in case you are interested in adding more linters.

Describe the solution you'd like

As I said in the previous link, we could add TeXtidote which uses LanguageTools in latex documents. It just strips out latex from your document, calls LanguageTools which then checks spell and grammar for a lot of languages. TeXtidote then computes where are the locations of the errors/warning in your document.

There was an issue (solved now) to have an easy-to-parse output (see here), so I guess it would not be too difficult to use.

It main be expensive to run it at every save for example, but having a command to call it would be nice.

tecosaur commented 5 years ago

Could be interesting. Probably more of a medium-term than short-term thing though.

tecosaur commented 5 years ago

@PierreMarchand20 something else I'd be interested in in part of speech syntax highlighting. Something like python's textblob may be useful. Another tool I'd be interested in adding support for would be vale. Would you be interested in working together to add more text/writing tools to this extension?

PierreMarchand20 commented 5 years ago

I have never worked on a vscode extension and typescript, but that could be interesting.

I looked at Vale but it seems that there are not kind to add support to LaTeX (https://github.com/errata-ai/vale/issues/54).

TextBlob seems interesting, but it only supports three languages (English, and via extensions, French and German) and it seems more focused on sentences analysis.

For the record, LanguageTool works for several languages (https://languagetool.org/languages), and it has also an extension in vscode, but it does not support latex.

In any case, I think we can always redo what Textidote does for calling LanguageTool on latex documents, which is replacing/removing every latex environments using regex and then call any linter we want.

tecosaur commented 5 years ago

If you know javascript and are willing to look at docs you'll be good :) if not: no worries. Long term, I think the best approach would be to implement the latex to plaintext, then translate positions back so that our choice of tools isn't limited. I wonder if we can just steal find inspiration from TeXtidote's implementation.

Then we can leave the installation of individual tools to the end-user, and just supply 'presets' i.e. the TeXtidote thing + output parsing to warnings etc. Perhaps have a setting like this

"latex-utils.externalToolIntegrations.enabled": ["vale", "TextBlob", "LanguageTool"]

What are your thoughts?

PierreMarchand20 commented 5 years ago

I do not know javascript, but I have already done some small contributions to one or two extensions for vscode and it did not seem too complicated ^^ I will try to help.

Yes, I also think it is the best approach! This way we can use any linter with the plaintext output a priori. The only thing we need to do for each linter is to parse their output, which should be not too difficult if it is well correctly done.

tecosaur commented 5 years ago

Sounds great! Here's my current plan:

PierreMarchand20 commented 5 years ago

Perfect !

tecosaur commented 5 years ago

Done! See https://github.com/tecosaur/LaTeX-Utilities/tree/dev-linting

I added vale as an example of how I think it could work.

For input cleaning I think we could start by looking at https://github.com/sylvainhalle/textidote/blob/master/Source/Core/src/ca/uqac/lif/textidote/cleaning/latex/LatexCleaner.java

Addendum: I'm a bit busy over the next few weeks, but I'll try to put in time where I can

PierreMarchand20 commented 5 years ago

Perfect ! I am also quite busy starting a new job but I will also try to take a look ^^ I think it could be really interesting for all the LaTeX writers around here ahah

tecosaur commented 5 years ago

I did a bit more with vale to see what sort of things we can hope for, and it's looking good! image I think the framework is generic enough that we can hope to do something similar with a bunch of tools, once we get the LaTeX to plaintext thing working (and then also a bunch of other issues).

PierreMarchand20 commented 5 years ago

Ok, I downloaded vale and cloned the repo. I started to play with it a little bit and I have some questions (remember, I have never worked on an vscode extension or with typescript).

PierreMarchand20 commented 5 years ago

I have not understood what is done in textidote, but here’s how I think we can implement that:

With these two objects, I think we can do the change of coordinates for the diagnostics given by the linter.

I will try to implement that next week.

PierreMarchand20 commented 5 years ago

But I must say that vale is almost working straightforwardly for my latex documents...

I am not a fan of the output “writing-good” though ^^ I have to try other styles

tecosaur commented 5 years ago

Great to see you've started looking at this! I look forward to seeing more of your work.

Answering your questions

In main.ts we have this block

vscode.workspace.onDidSaveTextDocument((e: vscode.TextDocument) => {
    if (e.uri.fsPath === extension.completionWatcher.snippetFile.user) {
        extension.completionWatcher.loadSnippets(true)
    } else {
        extension.wordCounter.setStatus()
        if (utils.hasTexId(e.languageId)) {
            extension.diagnoser.lintDocument(e) // <--- this line here
        }
    }
}),

Mimicing texidote

This seems like the sort of thing which benefits a lot by chosing a good data structure for it. I'll have a think about it and let you know if I have any ideas different to yours.

My own question

How do you find the interface IDiagnosticSource? (as implemented in vale.ts). If you've got any feedback it would be best to make use of it now rather than later :)

tecosaur commented 5 years ago

( btw, I don't think we should need a vale path, because vale <arguments> should work, but for some reason it didn't work on my machine )

PierreMarchand20 commented 5 years ago

Thank you for your answers! Are there any reasons why you need to check e.uri.fsPath === extension.completionWatcher.snippetFile.user for calling the linter ?

I will wait for your feedback on the data structure before starting to implement it then ^^

IDiagnosticSource looks nice, the only thing is that I would add somewhere is a way for the user to give its own args to each linter.

PierreMarchand20 commented 5 years ago

A remark I would make though is that you are always aiming to linters that we have via command line. But for example, Vale or LanguageTool both have a vscode extension, so may be it would be possible to use them ? It is just an idea.

PierreMarchand20 commented 5 years ago

I noticed that Vale has a lot of style to help with styles, but I am not sure it really helps with grammar. LanguageTool also has a java executable standalone here and it is available in a lot of languages, I suggest we add this one after Vale ^^

PierreMarchand20 commented 5 years ago

( btw, I don't think we should need a vale path, because vale <arguments> should work, but for some reason it didn't work on my machine )

I will remove it and let vale instead then. I noticed I had to reload vscode the first time to update my PATH variable that must contain the path to 'vale'. Otherwise, try to output 'process.env'

PierreMarchand20 commented 5 years ago

I started to play around with LanguageTool, the JSON format is defined here, I will try to define the associated linter.

tecosaur commented 5 years ago

Let's see

What's e.uri.fsPath === extension.completionWatcher.snippetFile.user doing?

More command line args

Vale or LanguageTool both have a vscode extension

Data structure

LanguageTool after Vale

PierreMarchand20 commented 5 years ago

Ok for e.uri.fsPath === extension.completionWatcher.snippetFile.user, what does it have to do with linters ? Could we put the lines

if (utils.hasTexId(e.languageId)) {
     extension.diagnoser.lintDocument(e) // <--- this line here
}

outside of this check ?

Could you show me where you found the documentation for the vscode diagnostic ? So that I can do the link between the vscode diagnostics and those of LanguageTool.

PierreMarchand20 commented 5 years ago

I have a first version of LanguageTool.ts, could you look at it ? The only issue was that they do not use position as (line,character), but only give an offset in the whole document. And of course, there are a lot of false positive due to latex :-)

LanguageTool offers a lot of outputs, I just used a small part for now... And I am yet to do the code-actions.

PierreMarchand20 commented 5 years ago

I just added the code actions, I learned a lot of typescript in a few hours xD

PierreMarchand20 commented 5 years ago

Last thing and then I stop spamming. The issue is that LanguageTool gives an offset and a length, considering the document as a big string. It is a recurring problem with VSCode so there are functions (positionAt) associated with vscode.TextDocument to go from this representation to using a pair of (line,offset). Problem: we need to call LanguageTool on vscode.TextDocument after we remove all the latex, so I am not sure we will still be able to use positionAt.

tecosaur commented 5 years ago

Great to see the work you've done, thanks for helping out with this!

Tweak if statement in main.ts

tecosaur commented 5 years ago

Furthermore to the NodeError: stdout maxBuffer length exceeded error, I think the way to solve that is by switching from using execFile to spawn.

PierreMarchand20 commented 5 years ago

Great!

I think I also got NodeError: stdout maxBuffer length exceeded, I guess this is because there are too many false positive with the latex environments. Once we will have solved the issue of removing all the latex, it should not appear again.

What async does ? It seems that it should be used with await usually. I agree the zcurrentlyRunning` idea!

tecosaur commented 5 years ago

I guess this is because there are too many false positive with the latex environments

This is just an issue with long outputs. Using spawn we should be able to handle much larger outputs. I'll take a look at that.

What async does?

Afaik this should just help with it being non-blocking

I agree the currentlyRunning idea!

Shouldn't be too hard to implement :smiley:

tecosaur commented 5 years ago

Switched to spawn and implemented currentlyRunning (as currentProcess) in 75143f05fd18c35a666f88a949529deb6b68a3f4

tecosaur commented 5 years ago

Minor update: I tested my big-ish tex file and it all worked. Also had a think about the data structure and arrays of removed content (one for lines, on for characters) seems like a relatively decent approach to me.

I'm getting a really good feeling about this!

tecosaur commented 5 years ago

Hmmm. Another thought, since these tools run on the whole file (and can take a few seconds), it could be nice to make use of a stateful model. I.e. change a few words in a paragraph, just run tools on modified paragraph instead of the full document. Perhaps we could just do this in-between saves with a timeout :thinking:

PierreMarchand20 commented 5 years ago

Perfect!

If we can create another file to parse with the linter, it should be not too difficult. But I was afraid it would be too expensive. And where could we create this file ?

I will try what I suggested then ^^

PierreMarchand20 commented 5 years ago

Hmmm. Another thought, since these tools run on the whole file (and can take a few seconds), it could be nice to make use of a stateful model. I.e. change a few words in a paragraph, just run tools on modified paragraph instead of the full document. Perhaps we could just do this in-between saves with a timeout 🤔

or we can define a command so that the user decides when he wants to call the linters?

PierreMarchand20 commented 5 years ago

It would be really interesting to run it only on modified parts, as you said, it takes time. But may be it would be difficult to do it in a first time directly.

tecosaur commented 5 years ago

The best bet is probably just to create a file in the tmp folder. See the use of tmpdir in tikzpreview.ts.

Too expensive

I think the contrary. We can easily run multiple tools on the same file, and maybe even make incremental changes

define a command so that the user decides when he wants to call the linters It would be really interesting to run it only on modified parts, as you said, it takes time. But maybe it would be difficult to do it in the first time directly.

I think the first time we just run on the whole file, there's not really much else you can do. Then as part of the linter-config we could have a setting like

"latex-utils.diagnoser.tools": {
  "languagetool": {
    "enabled": true,
    "args": [],
    "run": "onSave" | "onSaveAndIncremental" | "incrementally"
  }
}

or similar. Basicly I see there being three main modes:

  1. Lint entire file on save
  2. Lint entire file on save, and incrementally with timeout
  3. After initial whole-file run, run incrementally on save & timeout
PierreMarchand20 commented 5 years ago

Ok, I will look at tmpdir then.

What do you mean by incremental changes ? Anyway, sounds great to me

tecosaur commented 5 years ago

By incremental changes I mean

- this is is a test
+ this is a test

then just rerunning the linters on the changed line.

PierreMarchand20 commented 5 years ago

Ok I see.

What I meant is that may be we should first implement the linter on the whole file, removing the latex and so on. And then, when it will work, we can try to implement the incremental feature, what do you think ?

tecosaur commented 5 years ago

That's what I was thinking :smiley: I just think it's good to keep these things in mind when implementing to we don't unintentionally make our lives harder later on.

On that note, with the LaTeX to plaintext conversion, I think I'd be good to use a very modular/versatile approach for the parsing. I'd like to do some things such as

Given the existence of user-definable commands, I think it would be best to operate on a blacklist model. So I'd imagine the config would be split into two sections, commands and environments. Then for each item have a few properties that set ignoring/passthrough of some/all mandatory/optional arguments This would ideally be able to mix in / be replaced with a user config setting, any single JSON object should work.

From there the best thing to do would probably be just to compile a giant regex expression and run that against the document.

Those are my ideas, what do you think?

PierreMarchand20 commented 5 years ago

That's what I was thinking 😃 I just think it's good to keep these things in mind when implementing to we don't unintentionally make our lives harder later on.

I agree perfect!

On that note, with the LaTeX to plaintext conversion, I think I'd be good to use a very modular/versatile approach for the parsing. I'd like to do some things such as

Ignore arguments of specific commands — e.g. *ref{...} Ignore some environments, and their optional arguments — e.g. tikzpicture, minted Given the existence of user-definable commands, I think it would be best to operate on a blacklist model. So I'd imagine the config would be split into two sections, commands and environments. Then for each item have a few properties that set ignoring/passthrough of some/all mandatory/optional arguments This would ideally be able to mix in / be replaced with a user config setting, any single JSON object should work.

From there the best thing to do would probably be just to compile a giant regex expression and run that against the document.

I am not sure I understood everything.

In my mind I was just thinking of a list of commands and a list of environments as you say in a JSON files (so that a user can add his own). And then, for each command 'foo' we remove \foo[...]{...} and for each environment 'bar', we remove \begin{bar}...\end{bar}.

PierreMarchand20 commented 5 years ago

Then, there are some details of course (document environment, itemize/enumerate environment that we can keep, $...$, etc)

tecosaur commented 5 years ago

I feel as though I'm overthinking this I think (for commands) an array of regex is probably best.

Unfortunately due to nesting parenthesis (e.g. \graphicspath{{...}}) we can't use just regex :(

To clarify on my terminology, I think of latex macros like this:

tecosaur commented 5 years ago

why do you want to only ignore arguments?

By that I mean that we want to ignore the \command bit always, but sometimes we want to pass through an argument (e.g. \emph{...}), sometimes we want to ignore it (e.g. \tikzset{...}), then a similar thing with environments of course.

Not sure what the cleanest way of being able to represent differences is what we want done would be.

PierreMarchand20 commented 5 years ago

That is when we can look at what he has done in textidote ^^

PierreMarchand20 commented 5 years ago

Then, we can define a list of commands where we delete everything, a list of command where we keep the argument. And it is a similar for environments, but it is a little bit more difficult to keep the argument of an environment.

tecosaur commented 5 years ago

Perhaps then we just take the approach where, this may be a little heavy-handed, but it will be good for 99% of the document. Remove all optional arguments of environments, and then split commands into "keep arguments" and "remove arguments". That sounds reasonable to me.

tecosaur commented 5 years ago

@PierreMarchand20 how's it going?

PierreMarchand20 commented 5 years ago

I did not have the time to do it so far, may be this weekend.

PierreMarchand20 commented 5 years ago

Ok I looked a little bit, I think actually this easiest would be to get the text of the file in a string and call regex functions. I do not see the advantage of copy paste the text in an auxiliary file. What do you think ? We can still do every operations on a string and then create an auxiliary file to store the cleaned string if it is really necessary.

With this in mind I wanted to use the function matchall (https://javascript.info/regexp-methods) which seems really perfect for us, but I cannot manage to use it. I added the necessary lib as mentioned here but still, when I compile, it does find the function. Any ideas ?