LaTeX, Markdown, Wiki, RST and other markup support

kostyfisik commented 8 years ago

There is a nice note on an integration of LT into LaTeX workflow http://wiki.languagetool.org/checking-la-tex-with-languagetool However, LT lacks LaTeX support actually. E.g. document in any language written in LaTeX format will probably contain LaTeX commands like \documentclass \usepackage \begin{document} and so on. Large documents are usually split into separate files, still they contain quite a bit of LaTeX commands, like \chapter \section \begin{figure} and so on.

The initial support can be as simple as to ignore all command words, they are starting with \ (optionally with the adjoint text in {} and [] brackets as command arguments) and to treat a ~ (unbreakable space) as an LT space. This can probably be switched on as rule option or auto-detected (typical LaTeX commands are well known). An advanced support will be to treat math evaluations (include $ marks) without = as nouns. E.g. "Let $\mathbf{F}_i$ be the force, $m$ is the mass of the body, and $a$ is the body's acceleration. According to Newton's second law $\mathbf{F}_i=ma$ if the mass does not dependent on the speed"

If this idea seems to be valid, please, point me the location of language-independent rules, I will try to provide a PR to improve LaTeX support.

danielnaber commented 8 years ago

We've decided that supporting any file format is out of scope for LT. LT only checks plain text, everything else is just too much work, too much support, and too much maintenance. Instead, the editors that people use should integrate LT and provide LT with the plain text.

kostyfisik commented 8 years ago

Ok. Is it possible to pass to LT in plain text part of speech tags without passing the word? E.g. for math evaluations without equality it can be a good idea to treat them as generalized nouns (so some rules can trigger on them).

kostyfisik commented 8 years ago

Math with = is usually read as "smth equals smth" so it should be "noun equals noun". It cannot be replaced directly this way in plain text for any language, as soon as, for example in Russian word "noun" has a gender and will trigger false positive.

yakovru commented 8 years ago

May be simple improve LYX support?

danielnaber commented 8 years ago

You can pass https://languagetool.org/development/api/org/languagetool/markup/AnnotatedText.html to LT, so it knows what's markup and what's not, that's the closest we have.

kostyfisik commented 8 years ago

@danielnaber I see. "On the editor side" approach seems to be ok, however it looks like there is a lot of duplicated code that needs to be wrihetten for each editor (even if it they are both LaTeX, Markdown, or Wiki. E.g. TexStudio looks to have done this, however, LT plugin for Emacs does not).

Probalby a generic markup language support can be added to LT using the existing framework. Moreover, it looks like it is already here (with Annotated text feature). The missing part is grammar.xml-like file for each markup dialect with rules to detect and isolate the markup. This files can easily be maintained with users. Moreover, ading new language for this detect_markup stage should be as easy as to provide a new xml file with markup detection rules.

danielnaber commented 8 years ago

An editor specialized for LaTeX should already be capable of parsing LaTeX anyway. I'm not sure if using a grammar.xml-like file is a viable approach, wouldn't it mean to try to parse every file format with regular expressions? This seems quite tricky or even impossible.

kostyfisik commented 8 years ago

This issue is not about file formats (and yes, it is a bad idea to parse any file format with regex), and it is not about parsing only LaTeX. There is a bunch of markup languages (like LaTeX, reStructuredText, Markdown, Org-mode, Wiki-syntax, etc.) each of them has a rather limited number of tags to mark sectoins, font, URL and so on. They put this tags into a plain text, tags are human readable and are usually typewritten by hand. They a really widespread, e.g. Markdown is used for GitHub wiki by default, Wiki-syntax can be found in any public or private knowledge base and so on.

My first approach, when I got aware of LT was just to copy text from my favorite editor to LT website and check the text. However, it is always a pain to filter real error and fake one, which happen due presence of the markup tags.

On the other hand it is not rational for core team of LT to support every markup language. However, it looks like (probably I am wrong) that LT internally is already ready to process plain texts with markup in a user comfortable way due to presence of AnnotantedText class. I suppose that external contributors can easily be found to provide xml rules for any viable markup (I will try to provide LaTeX subset if it will be possible). So the only missing part is how to provide an information about exact markup subset to LT.

From my point of view it can be a 2 stage process: 1) Use grammar-markup.xml to get markup information from inital text 2) run LT as usal on the preprocessed text.

The first step is almost identical with the second one. A basic makrup rule can be trivial: if found a markup tag - remove it. This way it will be an ordinary LT run and auto apply of all possible corrections.

An advanced option, usefull for LaTeX and Wiki math mode is to treat formula as a noun or a trigram "noun equals noun" (depending of it there is a = sign or not). This can request some tweaking to the internals of LT, anyway, it is not needed for a basic support of makrups.

danielnaber commented 8 years ago

Maybe a sustainable approach would be to make use of syntax highlighting code that already exists for all text-based file formats.

kostyfisik commented 8 years ago

I am not sure for a number of reasons. Fist of all, there are too many types of makrups, take a look https://en.wikipedia.org/wiki/Comparison_of_document_markup_languages This way, it can be hard to find source for each of them. More over, e.g. Markdownd has a number of dialects (e.g. commonMark and Git flavored Markdown), TeX famialy has LaTeX and conTeXt mods and so on.... So, once again, it doesn`t look rational for core team of LT to support every markup language.

The second point is that there is no need for LT to provide all advanced features of editors, that usually use different colors for different types of text. E.g. plain HTML markup can mostly be cleaned-up with a single regex - you need just to omit as a markup any text with angle brackets. Markdown markup is mostly connected with the beginning of the string, commonly used markup list has probably just a dosen of options, so they are easy to remeber and even more easily can be detected with regex.

The third point is that regular expressions are well known standart and it is easy to contribute this way. Moreover, LT has a lot of power to process text using regex, so it looks to be not so hard (probably I am wrong here, but it really looks to be quite logical).

So my feature request should not be too hard or time consuming. On the website this can be a third drop-down list named "Markup" after "English" and "American".

This rule sets can also easily be used for autodetection of markup - just sort them by the number of rule triggering from most popular grammar-markup.xml.

kostyfisik commented 7 years ago

After moving to LT 3.6 integration with TeXstudio is broken down (same should be for many other plugins and editors who had missed switching to new API). So native integration with LaTeX (and probably other markup) is still a problem. I tried to open native LaTeX file in LT standalone and found that it is quite usable now. There is still a problem with unbreakable space, so e.g. "in~figure" is treated as one token.

Is it possible to make an xml rule (switched off by default) that will treat symbol ~ as a withespace? It can be Java rule, the main idea that is should be easily switched on and off by the user. This rule can easily be extended with ignore list of other latex keywords if needed.

martinvonwittich commented 7 years ago

Regarding RST as used by Sphinx: we just compile our markup to XML (sphinx-build -b xml) and then use a custom Perl script that processes this XML to plaintext that we can feed into LT. Sphinx can compile RST to plain text on its own, but that text output is meant for humans and contains unsuitable stuff like ASCII art tables. Parsing the XML ourselves has the huge advantage that we have very fine control on how the text is assembled before it's checked by LT; for example we replace images with their alt text, and we skip all literals so that LT doesn't try to check technical stuff like file names, console commands or console command output.

kostyfisik commented 7 years ago

@martinvonwittich How do you push the changes after LT correction back to RST markup source?

martinvonwittich commented 7 years ago

@kostyfisik we're not using LT to correct anything automatically, just for checking, so everything is changed manually in the RST source.

kostyfisik commented 7 years ago

@martinvonwittich I see, thanks!

BTW, TeXStudio fixed LT integration, moreover, they have a support of "~" as an unbreakable space build-in now!

felipesere commented 6 years ago

What if LT took a line-ranges to be ignored as a command line argument? I am using LT in vim while editing blog posts and I just need it to ignore code. I am pretty sure I could teach vim to look for {{ < highlight ... > }} and {{ < / highlight >}} and translate that into line numbers that should be ignored by LT.

danielnaber commented 6 years ago

What if LT took a line-ranges to be ignored as a command line argument?

LT doesn't internally deal with line numbers. Actually, I'm not sure why this issue is still open - we don't have the resources to support file formats and the feature to help external editors ignore markup is already there (see above). So I'll close this issue, if there's a specific thing LT can do, feel free to open a new issue.

oblitum commented 4 years ago

Seems textidote is the proper wrapper to handle this for LaTeX and Markdown.

languagetool-org / languagetool

LaTeX, Markdown, Wiki, RST and other markup support #445