(proposal) Markdown-like chat message parsing

KamilaBorowska commented 7 years ago

Current chat message parser doesn't exactly deal well with edge cases so I'm interested in changing it to be more compatible with other Markdown implementations, specifically CommonMark specification.

The reason why I want to do so are:

Making it easier to explain in chat how to use chat formatting. For example, let's say somebody wants to explain how to write code (``code``).

Currently this involves saying something like that:
- User: Start code with ``
- User: End it with ``
When it would be great to just be able to say:
- User: ``code``
Avoiding accidental formatting. For instance, currently typing ``__proto__`` will cause unintended proto to appear. Yes, this will break formatting within code, I don't think it's a big issue myself - it's more annoying than helpful.
Compatibility with Discord chat formatting. People may be used to how Discord formats stuff. Discord uses Markdown itself, so it would be great if the same formatting system could be used on Showdown so there would be no need to switch context.
Improving the situation with edge cases. Currently edge cases are unpredictable with how they will be parsed.

The Markdown implementation of chat message parser would support the following Markdown features.

Bold (**text**)
Italics (_text_, *text*)
Code (`Code`, ``Code`` and so on)
Backslash escapes (\`, \* and so on)
Entity references (&)
Autolinks (<http://example.com>, <example@example.com>)
>green text (>text)

There is no support for links, images and raw HTML. This is intentional, as those features would likely be misused. For instance, link text may be misleading compared to its actual location.

Additionally those non-standard Markdown features would be implemented:

Italics with double underscore (__text__)
~Strikethrough~ (~~text~~)
^Superscript (^^text^^)
_Subscript (\\text\\)
"I feel lucky" links ([[text]])
Autolinks without <> brackets (http://example.com/)

For specification of how this feature would work check out http://spec.commonmark.org/0.27/#inlines. Strikethrough and superscript are parsed in similar way to * character in that specification. [[ is parsed in similar way to [ in this specification - shortcut reference links section to be exact. Links without <> brackets will be parsed just like current implementation does, I don't see any issue with it.

Subscript is rather tricky, but it will likely involve DWIM code whose purpose is to determine whether you wanted backslash escape or not, depending on whether backslash escape would be needed on not. Still need to figure out this part precisely (as far specification goes).

This is a proposal. My intent is to reduce number of incompatible changes as much as possible, but it's unavoidable that some edge cases will be parsed differently - after all one of reasons to do it is to make parser more consistent.

A new implementation should be quite fast as it would be based on a state machine (similar to how programming languages are parsed), parsing every character just once without backtracking. I don't see anything in CommonMark that would specifically prevent doing this in O(N) time

Zarel commented 7 years ago

Yes, this is already being worked on in chat.js. We'll port it to the client once it's good enough.

Zarel commented 7 years ago

I mean, the issues. We're not going to take Markdown format for various reasons, mainly that Markdown is not designed for end users, and has a lot of snags when dealing with end users (the biggest one relevant to us being that it makes ascii art impossible).

KamilaBorowska commented 7 years ago

(the biggest one relevant to us being that it makes ascii art impossible)

Not sure how it is different to what currently is here, at least with Markdown you can sorta escape characters if you want if they end up being metacharacters.

Zarel commented 7 years ago

Okay, that was unclear. Let me try again.

We will fix: Issues involving code in URLs and code blocks not being escaped, issues involving nesting formatting being weird

We will not have: single-character formatting markers, like _text_ for italics

Zarel commented 7 years ago

Markdown was designed for programmers. It was designed for everyone to read, but it was designed to be written by programmers, not the general public.

Reddit is the most infamous example of Markdown misuse, such as linebreaks not appearing, and also comments starting with things like "52." are automatically converted to "1." (because Markdown renumbers lists), or hashtags being converted to titles.

GitHub has Github Flavored Markdown which fixes some of these issues.

But ASCII art is still a problem. ¯\_(ツ)_/¯ still needs two escapes. And you should not expect users to know how to escape text.

Morfent commented 7 years ago

Are multi-line code blocks being considered for this? They'd be helpful for techcode and dev so Hastebin and ilk wouldn't need to be used for short chunks of code that aren't one-liners

Zarel commented 7 years ago

PS does not currently support multi-line messages, and changing that would be difficult, I think.

panpawn commented 7 years ago

PS does accept multi-line commands, though - so we could make like, !code [code here that spans multiple lines]

edit; it could be suppressed by default, using <summary> and <detail> tags

Zarel commented 7 years ago

I'm okay with !code for a multi-line code block

panpawn commented 7 years ago

Pull request for !code: https://github.com/Zarel/Pokemon-Showdown/pull/3802

Zarel commented 7 years ago

Really, this is why PS leaves single symbols alone - casual users are not going to know how to type the things they want to type. a*b*c = d et al should really not be turned into italics or whatever.

Zarel commented 6 years ago

CommonMark Example 333 is a good example of why I don't want Markdown: http://spec.commonmark.org/0.27/#example-333

5*6*78 <p>5<em>6</em>78</p>

smogon / pokemon-showdown

(proposal) Markdown-like chat message parsing #3766