showdownjs / showdown

A bidirectional Markdown to HTML to Markdown converter written in Javascript
http://www.showdownjs.com/
MIT License
14.26k stars 1.56k forks source link

Encoded emoji in header anchor id (GitHub flavor) #814

Open sznowicki opened 4 years ago

sznowicki commented 4 years ago

I've stumbled upon a behaviour that at first I thought is fine (emoji FTW), but then someone pointed out that it doesn't match how it's done on GitHub.com, therefore I guess there's either something to fix, or to leave with some little explanation.

Problem: This MD:

# 🤪Hello ?

Will result in this HTML:

<h1 id="🤪hello-">Hello</h1>

I've found it while implementing a feature that copies a new url with an anchor when user clicks one of the <h*> elements.

The "new url" code is like this:

const u = new URL('https://example/com');
u.hash = '🤪hello-';
const newUrl = u.toString();
// "https://example/com#%F0%9F%A4%AAhello-"

Which... works fine on all the browsers I tested (Chrome, Safari, FF, Brave). However, as pointed out, Github does it differently: all emoji are converted into "-".

Also after giving extra thoughts I have some concerns regarding readability and possible use case when user needs to write this url down somewhere using analog tools (eye, hand and paper).

My config:

const converter = new showdown.Converter({
        ghCompatibleHeaderId: true,
        strikethrough: true,
        tables: true,
        tasklists: true,
        ghMentions: true,
        encodeEmails: false,
        metadata: true
    });
    converter.setFlavor('github');

    converter.setOption('simpleLineBreaks', false);

As far as I checked the showdown source code ghCompatibleHeaderId should end up doing similar h* to anchor id as on Github. It doesn't, so I thought it's worth the effort of posting this here.

Repro: https://github.com/sznowicki/repro-showdown-github-anchors

PS. I am aware that my showdown initial config might be redundant or overwritten by .setFlavor.

PRO-2684 commented 2 years ago

I've the same problem, except that anchors with emojis doesn't work due to some encoding problems. Maybe adding an emojiAllowedInId option or so will help?

tivie commented 2 years ago

Yeah, emojis in headers might be problematic. Surprisingly, the solution is not trivial though. See this stackoverflow post.

Mostly because of support for languages that use different characters than the roman alphabet

sznowicki commented 2 years ago

I tackled this in the end by creating extension that replaces them.

It's been quite a while on production and nobody complaints, so I assume it's works as on GitHub.

Code below:

const emojiRegex = require('emoji-regex');

const replaceEmojisWithDash = (text) => {
  return text
    .replace(emojiRegex(), '-')
    .replace(/^-+|-+$/g, '');
};

const EmojiHeadersExtension = {
  type: 'output',
  regex: /(<h[1-6].* id=")(.*)(".*>.*<\/h[1-6]>)/g,
  replace: (match, prefix, id, suffix) => prefix + replaceEmojisWithDash(id) + suffix,
};

module.exports = EmojiHeadersExtension;
tivie commented 2 years ago

thank you. That package, emoji-regex, lists every emoji in the regex. A bit of a brute force approach but I don't think there is a more elegant solution for this.

https://github.com/mathiasbynens/emoji-regex/blob/main/index.js