mzlogin / vim-markdown-toc

A vim 7.4+ plugin to generate table of contents for Markdown files.
http://www.vim.org/scripts/script.php?script_id=5460
MIT License
613 stars 59 forks source link

Incorrect headings generated for my `heading-torture-test.md` file #87

Open eliminmax opened 8 months ago

eliminmax commented 8 months ago

Thank you for making such a great plugin.

I created a markdown file (available here) designed to see how GitHub generates heading IDs in different cases ranging from common (like headings containing non-[a-z] letters like the German ß, Arabic ا, and Chinese , to weird cases with numbers at the end of headings.

Several of the headings generated by this plugin when I run :GenTocGFM in that file are different than the ones generated by GitHub.

Most of the issues had to do with headings with numbers at the end, though the Arabic ا was incorrectly deleted, as was a trailing underscore.

Click here to see what this plugin generates for my test file, with notes where it got it wrong. ```markdown * [test.md](#testmd) * [Same Level Same Name](#same-level-same-name) * [Same Level Same Name](#same-level-same-name-1) * [Different Level Same Name](#different-level-same-name) * [Different Level Same Name](#different-level-same-name-1) * [Same Name Differing Caps](#same-name-differing-caps) * [SAME NAME DIFFERING CAPS](#same-name-differing-caps-1) * [same name differing caps](#same-name-differing-caps-2) * [Same Name( )different-Non-»letter° chars](#same-name---different-non-letter-chars) * [Same Name &^$ different Non letter chars](#same-name--different-non-letter-chars) * [Same Name but One Has Code](#same-name-but-one-has-code) * [Same Name `but` One `Has Code`](#same-name-but-one-has-code-1) * [Ending Number Trickery](#ending-number-trickery) * [Ending Number Trickery](#ending-number-trickery-1) * [Ending Number Trickery 1](#ending-number-trickery-1) * [Ending Number Trickery](#ending-number-trickery-2) * [Ending Number Trickery 2](#ending-number-trickery-2) * [Other Ending Number Trickery 1](#other-ending-number-trickery-1) * [Other Ending Number Trickery](#other-ending-number-trickery) * [Other Ending Number Trickery](#other-ending-number-trickery-1) * [Final Ending Number Trickery](#final-ending-number-trickery) * [Final Ending Number Trickery](#final-ending-number-trickery-1) * [Final Ending Number Trickery 1](#final-ending-number-trickery-1) * [Final Ending Number Trickery 1 1](#final-ending-number-trickery-1-1) * [Final Ending Number Trickery 1 1](#final-ending-number-trickery-1-1-1) * [Underscored_heading](#underscored_heading) * [Multiple__underscores](#multiple__underscores) * [\_Leading_underscore](#_leading_underscore) * [Trailing_underscore\_](#trailing_underscore) * [Heading with non-`[a-z]` letters like ß, ا, and 猫](#heading-with-non-a-z-letters-like-ß--and-猫) * [Heading with a Chinese punctuation mark (specifically '】')](#heading-with-a-chinese-punctuation-mark-specifically-) # test.md ## Same Level Same Name ## Same Level Same Name ## Different Level Same Name ### Different Level Same Name ## Same Name Differing Caps ## SAME NAME DIFFERING CAPS ## same name differing caps ## Same Name( )different-Non-»letter° chars ## Same Name &^$ different Non letter chars ## Same Name but One Has Code ## Same Name `but` One `Has Code` ## Ending Number Trickery ## Ending Number Trickery ## Ending Number Trickery 1 ## Ending Number Trickery ## Ending Number Trickery 2 ## Other Ending Number Trickery 1 ## Other Ending Number Trickery ## Other Ending Number Trickery ## Final Ending Number Trickery ## Final Ending Number Trickery ## Final Ending Number Trickery 1 ## Final Ending Number Trickery 1 1 ## Final Ending Number Trickery 1 1 ## Underscored_heading ## Multiple__underscores ## \_Leading_underscore ## Trailing_underscore\_ ## Heading with non-`[a-z]` letters like ß, ا, and 猫 ## Heading with a Chinese punctuation mark (specifically '】') ```
mzlogin commented 8 months ago

Thanks for reporting, I may look at it tomorrow when get some free time.

And if you can make a PR, feel free to commit it.

mzlogin commented 8 months ago

Your test cases are very useful. I'll try to fix the issues this weekend.

eliminmax commented 8 months ago

Thanks! I was working on writing an awk script to add the heading ids to the output of cmark-gfm, and I wanted to make sure to handle it right. Turns out the regexp to match all invalid characters is very complex, and in the regex dialect GNU's awk implementation uses, it's nearly 10 thousand characters long. I found a GitHub repository which includes a computer-generated JavaScript regexp to match all invalid characters in heading names. I created a python script based on that, to generate a series of AWK gsub statements for my script, splitting it into a bunch of smaller regexp patterns, but it requires the non-standard \uHH escape sequence added in the latest version of GNU awk, so it's not portable across awk versions, let alone vim. In case my script is still helpful, I've uploaded it as a gist here.

mzlogin commented 7 months ago

Please update the plugin to the newest version and try again, it should can handle your cases now. 🤝