markdown-toc support for accented characters broken

redochka commented 5 years ago

markdown-toc is generating :

  * [Considération](#consideration)

instead of

  * [Considération](#considération)

for this markdown:

## Considération

AeonFr commented 5 years ago

This is an inconsistency between how the implementation of the markdown slugify function works between this library and whatever library the user is trying to use.

For the time being, this little accent fix would be a great improvement since it would add compatibility with two very common markdown implementations:

marked, possibly the most used markdown parser.
remark-autolink-headers is a remark-based library that adds links to headers automatically. It also keeps accented characters in the ids. (It depends on remark-slug to generate the slugs, which in its turn uses github-slugger).
markdown-it: If you are using markdown-it, you are probably also using markdown-it-anchor and markdown-it-table-of-contents, so any implementation used by this library shouldn't be a big issue. I tried to test their implementation but couldn't get it to work due to an open issue with the markdown-it-anchor library :(

For reference, this is the sluggin function in marked:

Slugger.prototype.slug = function(value) {
  var slug = value
    .toLowerCase()
    .trim()
    .replace(/[\u2000-\u206F\u2E00-\u2E7F\\'!"#$%&()*+,./:;<=>?@[\]^`{|}~]/g, '')
    .replace(/\s/g, '-');

  if (this.seen.hasOwnProperty(slug)) {
    var originalSlug = slug;
    do {
      this.seen[originalSlug]++;
      slug = originalSlug + '-' + this.seen[originalSlug];
    } while (this.seen.hasOwnProperty(slug));
  }
  this.seen[slug] = 0;

  return slug;
};

I think implementing this function would be a great start.

Feature request: Maybe if in the future some inconsistencies between sluggify implementations arrive, we could delegate the responsibility to create the headings to this library instead of depending on another library, maybe as an optional function.

Edit: It looks that @mootari wrote a script to support that very same feature already in an issue of this same repo, although I'm not sure how to integrate it with the current code (I'm very newby at node)

mootari commented 5 years ago

@AeonFr The snippet you linked to was part of an assemble pipeline. You can find the full context here: https://github.com/mootari/Documentation/blob/aae6df5755fffa4b4f48caed53dfa7b8acf24270/generator/lib/plugins/render-markdown/index.js#L76

Be aware that markdown-toc already has an option to specify a custom slugify callback.

AeonFr commented 5 years ago

Thanks very much, actually it makes perfect sense to be able to use your own slugify function.

I leave my code for further reference:

function slugify(value) {
    return value
        .toLowerCase()
        .trim()
        .replace(/[\u2000-\u206F\u2E00-\u2E7F\\'!"#$%&()*+,./:;<=>?@[\]^`{|}~]/g, '')
        .replace(/\s/g, '-');
}

toc('# Some markdown', { slugify });

You may notice the accented characters are rendered "URL encoded", but Chrome is not making any problem with that and it works nevertheless

Generated by markdown-toc:
li><a href="#c%C3%B3mo-hacer-un-back-up-del-sitio-y-recuperar-una-versi%C3%B3n-antigua-del-sitio">Cómo hacer un back-up del sitio y recuperar una versión antigua del sitio?</a></li>

Generated by marked:
<h3 id="cómo-hacer-un-back-up-del-sitio-y-recuperar-una-versión-antigua-del-sitio">Cómo hacer un back-up del sitio y recuperar una versión antigua del sitio?</h3>

Again, this just works and it's a clean solution since it doesn't depend on the markdown implementation. I still believe it would be best if the library supported accents by default but I'm still glad I got this working.

Thanks

mootari commented 5 years ago

I still believe it would be best if the library supported accents by default but I'm still glad I got this working.

In my experience it is rather unusual for slugs to contain anything but unreserved characters. Let me also point you to https://en.wikipedia.org/wiki/Clean_URL#Slug.

jonschlinkert / markdown-toc

markdown-toc support for accented characters broken #145