Closed wget closed 7 years ago
We can find GFM transiliteration rules here, https://github.com/jch/html-pipeline/blob/master/lib/html/pipeline/toc_filter.rb. So, a better way to resolve this put things once and for all, is to code an identical implementation in VimScript.
Thanks for your PR, could you add some test cases to help me understand your change?
Sure.
But indeed, the idea is to be able fix the problem once and for all.
The main idea is to get the character classes for all unicode characters, like what we have in Ruby and other languages. Since the transliteration rules you mentioned are based on Ruby, it's best to use the same algorithm
In order to be able to have character classes in Vim, this requires us to download the unicode files and parse them. An existing Vim extension does already this, but only partially. I have been trying to add some refinements to it, bringing a partial implementation of the character classes, but with the refinement and the check against each glyph, the parsing is really TOO slow to be run in vimscript.
So the idea was to generate the unicode table mapping in another language like Python.
That way, vim-markdown-toc
can take benefit of it and even unicode.vim
as well. We could even modify unicode.vim and remove the download of the files from Vim (it's not its job).
In the meantime, I rebased and created a test like requested.
Your idea is great! What can I do for it, and what's your evolves now?
p.s. I'll merge your PR #30. We can continue discuss your idea in this one.
Github adds links with accentuated characters. Your regex was a bit too much restrictive in this regard, breaking my links in the process.
French speaking users of your plugin will appreciate.
For sure, other characters classes should be added to this regex, but we do not know how the Github GFM parser creates its transliteration rules (unless we do? I don't think their code is FOSS though).
For people who do not want to merge this: the already merged character class
\u4e00-\u9fbf
contains Asian chars which as per RFC 1738 are not standard as well since URLs should only contain ASCII chars (unless URL encoded).Regards,