lervag / vimtex

VimTeX: A modern Vim and neovim filetype plugin for LaTeX files.
MIT License
5.54k stars 390 forks source link

Respect file encoding with vlty gramma check #1829

Closed petRUShka closed 4 years ago

petRUShka commented 4 years ago

Issue Suppose we have tex-file in non utf-8 encoding and want to check grammar. vimtex together with vlty don't respect fileencoding so I get an error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 340: invalid start byte

minimal.vim

  set nocompatible
  let &runtimepath  = '~/.vim/bundle/vimtex,' . &runtimepath
  let &runtimepath .= ',~/.vim/bundle/vimtex/after'
  filetype plugin indent on
  syntax enable

  let g:vimtex_grammar_vlty = {}
  let g:vimtex_grammar_vlty.lt_directory = '/usr/share/java/languagetool/'
  let g:vimtex_grammar_vlty.lt_command = 'languagetool'
  let g:vimtex_grammar_vlty.encoding = 'cp866'

  if has('nvim')
      let g:vimtex_compiler_progname = 'nvr'
  endif

  au Filetype tex
              \ set spell |
              \ set spelllang=ru,en

minimal.tex

    \documentclass{article}
    \usepackage[cp866]{inputenc}
    \usepackage[russian]{babel}

    \begin{document}
    \selectlanguage{russian}

    Случайный текст на русском

    \end{document}

don't forget to convert file to cp866 encoding

iconv -t cp866 test.tex > test_cp866.tex

Commands/Input Setup vim and text_cp866.tex like above. Run vim on test_cp866.tex and execute:

:compiler vlty
:make

Observed Behaviour

:!python3 -m yalafi.shell --lt-directory /usr/share/java/languagetool/ --language ru --disable "WHITESPACE_RULE" --enable "" --disablecategories "" --enablecat
egories "" --documentclass "" --packages "amsthm,babel,enumerate,inputenc"  'test_cp866.tex'  2>&1| tee /tmp/vijaI15/129
=== test.tex
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.8/site-packages/yalafi/shell/__main__.py", line 3, in <module>
    from . import shell
  File "/usr/lib/python3.8/site-packages/yalafi/shell/shell.py", line 351, in <module>
    gentext.generate_text_report(proofreader.run_proofreader, sys.stdout)
  File "/usr/lib/python3.8/site-packages/yalafi/shell/gentext.py", line 89, in generate_text_report
    (tex, plain, charmap, matches) = run_proofreader(file)
  File "/usr/lib/python3.8/site-packages/yalafi/shell/proofreader.py", line 47, in run_proofreader
    tex = f.read()
  File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 132: invalid start byte   

Expected Behaviour

But yalafi.shell has an option --encoding which works well. So it is sufficient to check on fileencoding vim setting and send its value to yalafi via --encoding option. And\or accept

  let g:vimtex_grammar_vlty.encoding = 'cp866'

Output from VimtexInfo

System info
  OS: Arch Linux
  Vim version: VIM 8.2 (1-1704)
  Has clientserver: true
  Servername: VIM1

vimtex project: vimtex_test
  base: test_cp866.tex
  root: /home/user/vimtex_test
  tex: /home/user/vimtex_test/quintic_sextic_sqrt_periodic_maik.tex
  out: /home/user/vimtex_test/build//test_cp866.pdf
  log: /home/user/vimtex_test/build//test_cp866.log
  aux: /home/user/vimtex_test/build//test_cp866.aux
  fls: /home/user/vimtex_test/build//test_cp866.fls
  main parser: current file verified
  compiler: latexmk
    configuration: 
      continuous: 1
      callback: 1
      build_dir: build/
      latexmk options:
        -verbose
        -file-line-error
        -synctex=1
        -interaction=nonstopmode
      latexmk engine: -pdf
  viewer: Zathura
    xwin id: 98566145
    process: 
      pid: -
      cmd: zathura -x "/usr/bin/vim --servername VIM1 --remote-expr \"vimtex#view#reverse_goto(%{line}, '%{input}')\"" --synctex-forward 1:1:'/home/user/vimtex_test/test_cp866.tex'  '/home/user/vimtex_test/build//test_cp866.pdf' >/dev/null 2>&1 &
  qf: LaTeX logfile
    config: 
      overfull: 1
      general: 1
      packages: 
        biblatex: 1
        babel: 1
        default: 1
        hyperref: 1
        natbib: 1
        scrreprt: 1
        fixltx2e: 1
        titlesec: 1
      default: 1
      underfull: 1
  document class: 
  packages:
    amsbsy
    amsfonts
    amsgen
    amsmath
    amsopn
    amssymb
    amstext
    amsthm
    array
    babel
    bm
    calc
    caption2
    caption3
    enumerate
    epstopdf-base
    eufrak
    extsizes
    fontenc
    graphics
    graphicx
    ifthen
    inputenc
    keyval
    longtable
    mathtext
    natbib
    revsymb
    trig
    url
matze-dd commented 4 years ago

Thanks for reporting this. I'll have a look at it.

matze-dd commented 4 years ago

It seems that you can simply add to vimrc

let g:vimtex_grammar_vlty.shell_options = '--encoding cp866'

With your minimal.tex, I then only get a message from \selectlanguage{russian}, since this macro is still missing in yalafi. So, the filter just includes 'russian', and LanguageTool complains about the sentence starting lower-case. Furthermore, if I add two full stops at the end of the Russian sentence, then this is also correctly reported by LanguageTool.

Unfortunately, I cannot fully test your example, as my terminal is UTF-8 encoded, so that Vim does not correctly display the Russian sentence in the converted file test_cp866.tex.

Remark: If possible, add a country code to spelllang, e.g., ru_RU.

EDIT: If the first solution from above does work, we can decide if we read Vim's variable fileencoding and pass it to yalafi by default.

lervag commented 4 years ago

Thanks for reporting this. I'll have a look at it.

And thanks from me for responding!

My gut feeling is that it might not be so easy to directly translate between Vim settings for the encoding and the desired input for the --encoding option of Yalafi. But you're the expert here, so I think I'll let you be the judge of that. In any case, it does not seem so bad to add the encoding key as a possibility for the vlty config dictionary.

petRUShka commented 4 years ago

@matze-dd with shell_option workaround it is working fine but with some warnings:

Expected text language: Russian
Working on STDIN...
=== test.tex ===
1.) Line 6, column 17, Rule ID: UPPERCASE_SENTENCE_START
Message: Это предложение не начинается с заглавной буквы
Suggestion: Russian
 russian  Случайный текст на русском  
 ^^^^^^^

*** yalafi.shell: warning:
*** could not load module 'yalafi.packages.inputenc'
*** yalafi.shell: warning:
*** could not load module 'yalafi.packages.babel'
*** yalafi.shell: warning:
*** could not load module 'yalafi.packages.inputenc'
*** yalafi.shell: warning:
*** could not load module 'yalafi.packages.babel'

But having this option for whole vim installation is good but still it is a huge inconvenience in general because some files are in UTF-8 encoding and some files are in different ones.

Remark: If possible, add a country code to spelllang, e.g., ru_RU.

You mean set spellang = ru_RU?

matze-dd commented 4 years ago

@petRUShka

@matze-dd with shell_option workaround it is working fine but with some warnings ...

Great. I will add the commonly used packages babel and inputenc to the TODO list of yalafi. After fixing, this will avoid the warnings and the problem with \selectlanguage{russian}. Assume you have a mistake in the Russian sentence, for instance Случайный текст на руусском, and the cursor is on the corresponding message line in the quickfix window. If you hit Enter, does the cursor now jump to the beginning of руусском in the text window? I'm asking that because - at least in my system - there seems to be a problem with cursor location on lines containing non-ASCII characters (for instance, UTF-8 encoded German "umlauts" or left and right double quotes). In the quickfix window message, the correct (virtual) column number of the problem is given, but jumping there sometimes does not work properly.

But having this option for whole vim installation is good but still it is a huge inconvenience in general because some files are in UTF-8 encoding and some files are in different ones.

I understand. We will have to check whether reading the fileencoding variable works. Looking at @lervag's comments above, we might encounter problems. I will try to sort that out.

You mean set spellang = ru_RU?

Yes, at least for English texts, spell check is only done by LanguageTool, if a country code is given. But perhaps, this is not necessary for Russian. In your minimal.tex, you write ... set spelllang=ru,en .... Please be aware that vlty currently only recognises the first language specification from variable spelllang.

A last question. As mentioned above, my Vim does not correctly display the Russian sentence in file test_cp866.tex. Do you know what to do in order to fix that? My default encoding (terminal window and so on) is UTF-8. On :set fileencoding, I get fileencoding=latin1. This probably means that Vim cannot figure out the encoding, and I cannot find a command-line option to specify it. Modifying fileencoding in the opened file does not help, either.

matze-dd commented 4 years ago

@lervag

And thanks from me for responding!

As this is your project, I hope it wasn't impolite.

My gut feeling is that it might not be so easy to directly translate between Vim settings for the encoding and the desired input for the --encoding option of Yalafi.

I'll try to have a look. To get it work, at least fileencoding should be local to the buffer, and the code names should coincide between Vim and Python.

... it does not seem so bad to add the encoding key as a possibility for the vlty config dictionary.

Agreed. But I also understand the wish of @petRUShka.

petRUShka commented 4 years ago

@matze-dd, @lervag thanks for your work!

Encoding

I'm using following vim-code to autodetect encoding of file:

if has("multi_byte")
    if &termencoding == ""
        let &termencoding = &encoding
    endif
    set encoding=utf-8
    setglobal fileencoding=utf-8 bomb
    set fileencodings=utf-8,cp1251,cp866,koi8-r,latin1
    set fileformats=unix,mac,dos
endif

In case of cp866 files the fileencoding is automatically set to fileencoding=cp866.

Positioning

Positioning is bad in Cyrillic case. So in case of English sentence it is at the right place and in case of Russian sentence it seems to be on half way to right place (26 instead of 51 by vim count).

2020-10-15-145302_905x808_scrot

matze-dd commented 4 years ago

@petRUShka Thank you! With the following version, I now can correctly see the Russian sentence, coded in cp866:

...
    set fileencodings=utf-8,cp866,koi8-r,latin1
...

The positioning problem seems to be a general one:

@lervag (this might well be a separate Issue) Did you already see this positioning problem somewhere? I have a similar problem with German 'umlauts', for instance with the (English UTF-8) text

This is ä testx.

Here, the ä is ignored by LanguageTool, and the mistake testx is correctly reported at (virtual) column 11 in the quickfix window. Hitting Enter, however, moves the cursor to column 10 (between ä and testx) in the text window. In errorformat the column number has been passed as virtual column number with %v, but somehow this is not respected when jumping from the quickfix to the text window. The number now is apparently taken as (byte-oriented) column.

In the past, I have already browsed a bit the Web, but could not find any related report. For texts where almost all characters are non-ASCII, as in the ones from @petRUShka, this becomes quite visible. Column numbers and virtual column numbers almost never are the same.

If we cannot find a solution within Vim, I could rather simply add an output format version to yalafi.shell that reports the (byte oriented) column number. We could read it with %c into the errorformat, and then jumping to the error should be correct.

lervag commented 4 years ago

And thanks from me for responding!

As this is your project, I hope it wasn't impolite.

No, I'm just glad to have more people pitch in and help! :)

... it does not seem so bad to add the encoding key as a possibility for the vlty config dictionary.

Agreed. But I also understand the wish of @petRUShka.

Yes, I agree, this is a good point. I don't see immediately how to do this, though, except perhaps to somehow parse the \selectlanguage{...} command or options to the \usepackage{babel} to set the language for a given project.

Here, the ä is ignored by LanguageTool, and the mistake testx is correctly reported at (virtual) column 11 in the quickfix window. Hitting Enter, however, moves the cursor to column 10 (between ä and testx) in the text window. In errorformat the column number has been passed as virtual column number with %v, but somehow this is not respected when jumping from the quickfix to the text window. The number now is apparently taken as (byte-oriented) column.

Ah, I think %v is not what you think. Or, well, at least no fully. If you add a tab, the %v will work as intended. So, %v is not for handling multibyte characters, but for handling characters that are printed in multiple columns. I.e., sort of the opposite of what you want here.

matze-dd commented 4 years ago

Yes, I agree, this is a good point. I don't see immediately how to do this, though, except perhaps to somehow parse the \selectlanguage{...} command or options to the \usepackage{babel} to set the language for a given project.

If I understand @petRUShka right, only the file encoding should be detected automatically. It seems that variable fileencoding does give the right information. For instance, assume you open two files in a split window, one is UTF-8 encoded, and the other one Latin-1. Then saying

:echo &l:fileencoding

gives the correct result for both files, depending on the cursor position.

My proposal would be to add a key encoding to the vlty config dictionary, with default value 'utf-8'. The user can set it to another specification, and to '*' (or 'auto'?) for using the encoding indicated by variable fileencoding. I could propose a PR.

Ah, I think %v is not what you think. Or, well, at least no fully. If you add a tab, the %v will work as intended. So, %v is not for handling multibyte characters, but for handling characters that are printed in multiple columns. I.e., sort of the opposite of what you want here.

Looking at the Vim repository, and for instance here, it appears to me that this has been extended to multi-byte characters. In fact, it does work, if the message scanned by errorformat is a single-line message (this is also what they check in their tests). I'll open a new Issue, as this is a side-issue to the encoding problem.

petRUShka commented 4 years ago

@matze-dd, I see it exactly as you described. But may be it would be better to have auto by default. Why anyone should have utf-8 if he has opened file in different encoding and vim knows about it?

lervag commented 4 years ago

If I understand @petRUShka right, only the file encoding should be detected automatically. It seems that variable fileencoding does give the right information.

Yes, sorry, I was not thinking straight.

My proposal would be to add a key encoding to the vlty config dictionary, with default value 'utf-8'. The user can set it to another specification, and to '*' (or 'auto'?) for using the encoding indicated by variable fileencoding. I could propose a PR.

I think this sounds reasonable. I think the "auto" setting will be useful here, and as @petRUShka I think it might be the sensible default.

I'll be very happy to comment on a PR for this!

matze-dd commented 4 years ago

Yes, sorry, I was not thinking straight.

No problem. This is often my habit, too :) But sometimes, one unexpectedly has a good idea when thinking weirdly.

I think this sounds reasonable. I think the "auto" setting will be useful here, and as @petRUShka I think it might be the sensible default. I'll be very happy to comment on a PR for this!

OK, I'll try it next week, ребята!

lervag commented 4 years ago

Haha, yes, I agree some ideas come from "curved" thinking :)

Sounds good, looking forward to seeing a PR!

lervag commented 4 years ago

@matze-dd has pushed a PR that I just merged. Does this solve your issue, @petRUShka?

petRUShka commented 4 years ago

On my config works great:

:!python3 -m yalafi.shell --lt-command languagetool --encoding cp866 --language ru --disable "WHITESPACE_RULE" --enable "" --disablecategories "" --enablecateg
ories "" --documentclass "" --packages "amsbsy,keyval,fontenc,enumerate,eufrak,calc,inputenc,url,amsopn,babel,bm,amssymb,epstopdf-base,trig,amsmath,revsymb,ams
thm,array,longtable,caption2,natbib,amstext,mathtext,graphics,amsgen,graphicx,ifthen,caption3,extsizes,amsfonts"  'file_cp866.tex'  2>&1
| tee /tmp/vz3DWXB/15   

P.S. Only vlty config:

let g:vimtex_grammar_vlty = {'lt_command': 'languagetool'}
lervag commented 4 years ago

Great, I'm happy to hear it! Thanks to @matze-dd for implementing the feature!