wbamberg commented 5 years ago

We expect to use Markdown as an authoring format for GitHub-hosted content.

But there are several different versions of Markdown, with different features and levels of support by tools.

We should:

decide which version to use
decide if there are any extensions we need (e.g. code fencing, tables, front matter, ...)
decide which actual parser our tools ought to use

Acceptance criteria

decision is documented with a rationale
the chosen tool(s) along with any extensions is deployed

ddbeck commented 5 years ago

OK, I wrote a gazillion words to myself to sort out this Markdown question. For more about:

why we need to go to such detail
which specs and implementations I considered
and much more

see this Google doc here, but these are the important bits:

Requirements

The Markdown we choose must:

Follow a published specification, to what know what Markdown we’re actually getting
Have two or more implementations of that specification in two or more languages, to get some confidence that the specification actually has some force and has a future
Have one or more implementations available as an npm package
Support code blocks with syntax selection, which is rendered to HTML marked up for syntax highlighting

Beyond that, certain features are desirable for stumptown. It would be nice if the Markdown we choose has:

Compatibility with, if not strict adherence to, GitHub Flavored Markdown (GFM), to get some benefit from rich diffs in GitHub pull requests and rendered Markdown files
Compatibility with tools such as spellcheck or popular editors’ syntax highlighting
Syntax for tables
Syntax for definition lists
Support for front matter

Subjectively, I’d expect our chosen Markdown specification and implementation to have:

Active maintenance (e.g., published a release in the last year, showed issue tracker activity, etc.)
Good documentation
A comprehensible API (perhaps with an obvious plugin or extension API or hooks for linters)

If our chosen Markdown is missing a desirable feature, we may choose a plugin or extension which supports that feature, provided the plugin is congruent with our expectations for the main implementation such as being well-documented and maintained and works cooperatively with other ecosystem tools.

Proposal: use remark (via unified)

We should use GitHub Flavored Markdown (GFM) with the unified library (or its included remark library directly).

GFM meets our requirements and brings some benefits on top of CommonMark:

GFM meets our specification requirements. It’s well specified and there are several implementations of the spec and many tools with native support for GFM (e.g., spellcheck).
GFM is already well-documented by and familiar to contributors on GitHub.
GFM and its implementations support features we already recognize as desirable beyond those offered by CommonMark, such as fenced code blocks (with language annotation) and tables.
GFM, as specified, bars certain dangerous raw HTML, such as <script> tags (useful to minimize the risks associated publishing user generated content).

That said, GFM isn’t a perfect selection. It does have some shortcomings:

GFM offers some features we probably do not want to use (such as strikethroughs and task lists). We may be able to use linters (or modify our parsing) to prohibit their use, but this adds complexity.
GFM doesn’t provide a dedicated syntax for definition lists (though it doesn’t bar us from using raw definition list HTML in our Markdown documents). We may have the option of extending GFM, however (more on that below).
GFM itself doesn’t specify front matter, though GitHub itself does appear to parse and render YAML-formatted front matter just fine.

For implementation with stumptown, we should use remark or its parent library, unified.

unified/remark meets most of our requirements and demonstrates some strengths over other possible implementations. Some points in its favor include:

It has built-in support for GFM, which appears to conform well to the spec.
The project appears to be well maintained, with releases recently published and active issue trackers.
Of the Markdown implementations I looked at, it appeared to have the best documentation, particularly with sample code.
The API seems tidy, but impressive, with obvious hooks for inspecting and modifying the parsing and rendering of Markdown documents (e.g., to lint source, natural language, and resulting HTML). There’s even an included abstract syntax tree for Markdown documents(!).

unified/remark does have a few drawbacks, but they seem surmountable. For example:

As a large collection of small packages, finding the specific package that contains the functionality you need in unified can be difficult. If unified provides too much complexity, we could opt to use only the remark portion of the library.
Unlike, for example, Pandoc, unified/remark doesn’t include an extension for definition lists or for parsing front matter. There exist such extensions for unified/remark, but they represent independent dependencies we must evaluate, on top of unified/remark itself and on top of the general question of whether or not to extend GFM.

Ultimately, I couldn’t make a solid recommendation about whether to adopt any specific extensions to GFM or unified/remark, particularly for parsing front matter or definition lists. If we can commit to never exposing Markdown to stumptown consumers, then we can follow a principle I lay out below. If we can’t make that commitment, then we need to strictly adhere to GFM, leaving raw HTML (perhaps with custom tags) as our only option for “extending” GFM.

As a principle, if we adopt any extensions to GFM, then we should test those extensions for cooperation with GFM as specified. In other words, our extensions to GFM should be readable (if not pretty) in GitHub renders; GFM spell check or linters should be able to provide meaningful, if not complete, checks on our source. Neither GFM nor unified/remark appear to be a barrier to this principle, but I haven’t yet had an opportunity to test specific extensions for this.

wbamberg commented 5 years ago

From the Google doc (sorry I started commenting there then realised this place is probably better):

I’m going to assume that we’re choosing a Markdown only for internal use to stumptown and that we’re not going to ask stumptown consumers to parse Markdown.

We've mentioned before that l10n might be a consumer of Markdown. This doesn't seem certain, but how would that affect your recommendation?

wbamberg commented 5 years ago

Thanks for this, @ddbeck , it looks very sensible.

GFM and its implementations support features we already recognize as desirable beyond those offered by CommonMark, such as fenced code blocks (with language annotation) and tables.

These are both important features.

GFM itself doesn’t specify front matter,

I think we will need a way to process front matter (as the stumptown structures are currently defined). currently we're using gray-matter, apparently (https://github.com/mdn/stumptown-experiment/blob/master/scripts/build-json/compose-examples.js#L9). I haven't tried this, but it looks as if this would be independent of our choice of Markdown parser (it seems like it just gives you the Markdown in content, and you can then parse that as you like).

If we can commit to never exposing Markdown to stumptown consumers, then we can follow a principle I lay out below. If we can’t make that commitment, then we need to strictly adhere to GFM, leaving raw HTML (perhaps with custom tags) as our only option for “extending” GFM.

I think that even if we couldn't commit to "never exposing Markdown to stumptown consumers", we might still to be able to commit to never exposing front matter.

ddbeck commented 5 years ago

Thanks for taking a look at this, @wbamberg! I realize there was a lot to go over.

We've mentioned before that l10n might be a consumer of Markdown. This doesn't seem certain, but how would that affect your recommendation?

If we can obligate translators to handle Markdown in a particular way (unified has a nifty preset API for making it easier for this happen), then my recommendation still stands. In other words, if localized Markdown is ultimately converted to HTML for general consumption, then we can treat localization as an "internal" use, even if localization consumes a JSON structure that contains Markdown instead of HTML.

On the other hand, if we mean for Markdown to be an option for general consumption alongside or in place of HTML, then my recommendation would be to strictly follow GFM and use raw HTML for any extension use cases (e.g., use plain <dl> tags instead of an extension to Markdown). We might be able to do some tricks with custom elements/Web Components for more complex cases, but for the most part we'd be constrained to plain GFM and HTML.

I think we will need a way to process front matter

Yes, definitely. I sorta skated past that. You're right that we don't need to ever expose it to consumers—an unstated assumption on my part—and I didn't give it much thought beyond that. But to expand on the front matter situation a little:

The bad news is that nobody included front matter in a specification. The good news is that it doesn't seem to matter much, provided we use some conventional-looking front matter. That basically means YAML, blocked like this:

---
some: yaml
goes: here
---

(Or we could use TOML with +++ fencing, but I recognize that TOML is unusual and I'm in the tiny minority that prefers it.)

I was mistaken in my original write up: unified does have a package for parsing front matter, which we could use or we could stick to gray-matter. The semantics of remark's approach is slightly different—the front matter becomes a YAML content node of the document rather than something cleaved from the content—but it doesn't seem any harder to work with, if we want to stick to one ecosystem.

wbamberg commented 5 years ago

OK, thanks for the clarifications @ddbeck . I'm happy with the choice and the process you've used to arrive at it.

To close this issue, looking at the AC above:

decision is documented with a rationale

It would be good to record this choice and the reasoning for it in the stumptown repo rather than a random issue under mdn/sprints (basically copying your doc or a version of it somewhere there), but otherwise I think we can call this done.

the chosen tool(s) along with any extensions is deployed

I guess this is a quite simple change to stumptown-experiment.

ddbeck commented 5 years ago

It would be good to record this choice and the reasoning for it in the stumptown repo rather than a random issue under mdn/sprints (basically copying your doc or a version of it somewhere there), but otherwise I think we can call this done.

OK. I'm going to max out on hours this week. Should we put this officially in the next sprint, to open a PR summarizing the decision?

wbamberg commented 5 years ago

Should we put this officially in the next sprint, to open a PR summarizing the decision?

We talked about this in the planning meeting today. I think it would be good to keep this issue open to track this last bit, and add it to the next sprint, but as a lower priority for you than BCD. If you get time after BCD, then great, otherwise you can do that in a later sprint. I think with the work you've done here we have a solid basis to move ahead, and the remaining stuff is just paperwork really.

Does that make sense to you?

ddbeck commented 5 years ago

Sounds good!

a2sheppy commented 5 years ago

I generally feel that if we are going to use Markdown, we need to be able to avoid having to fall back to HTML as much as is remotely practicable. Any time you have to mix and match them to accomplish your tasks is a potential failure point in the markup that it would be best to avoid.

There are a number of articles about why Markdown is not a great choice for writing documentation, so I won't add to the ranting on that front, other than to say that I agree that it is not a good choice (other than to say that having to write while reading markup at the same time is tedious and awkward, so I hope we find a WYSIWYG editor to offer). But I presume that ship has sailed at this point anyway. :)

Some thoughts I have on this issue:

I am pleased to see that definition lists are on the radar, as they're a crucial component of our content structure.
We need to be able to have bullet points in lists that have multiple child blocks, such as multiple paragraphs, or a combination of paragraphs, sub-lists, tables, etc.
Tables are a special concern of mine. While table support is common in Markdown, we have tables in many places that are complex enough that they cannot be replicated using Markdown. I can't provide an example at the moment; I just know that I recently tried to add a table to a planning document that was formatted similarly to ones we use on MDN and was unable to do so because of things Markdown doesn't handle. One example: support for customizing the background and foreground colors in a row, or a cell, or a column. The bigger problem, however, revolves around tables which aren't strictly A by B cells, but have rows and/or columns with varying numbers of entries, split cells (vertically or horizontally), etc.
Biggest problem with tables in Markdown: You can't nest tables. We need to be able to do this, absolutely.
There are also not enough controls for formatting of images; you can't specify sizes, borders, etc. This is a huge problem given the need to support presentation of things like oversized images intended for high resolution display on a retina screen, or to allow for scaling of SVG diagrams (do those work in Markdown image format? If not, that's another problem).
When we first moved from Markdown to HTML many many years ago, part of the reason was that Markdown was corrupting source code samples. That may have been an issue specific to MediaWiki, and I don't recall the exact problem. It was a frequent issue, however, and left code snippets in a condition where you couldn't copy and paste them into your own code anymore, as so many people do.
Another one we need: the ability to do <kbd>, which is used frequently in developer tools documentation, as well as in documentation that includes user-entered strings, such as docs about <input> and anything that involves working in a console.
The ability to insert anchors at arbitrary points; that is, support for a syntax that replicates <a name="foo">...</a>.
Support for presenting mathematical content (MathML and/or LaTex). This is used here and there, with increasing frequency in the media and graphics content.
Superscript and subscript text.
Embedding of <video> and <audio>.
Support for creating image or table presentations wrapped in <figure>, with optional <figcaption>. This allows for much more control over presentation, and lets us label our figures, which we really need to do more of. I've been starting to do this in the WebRTC and other media docs.
Support for inserting things like info boxes, asides, and so forth.
A nice-to-have: the ability to give a URL to a file, perhaps on GitHub, and an optional range of line numbers or a function or object name, and have the referenced file (or portion of a file) presented inline -- to allow snagging code sample snippets from maintained code and presenting it as part of documentation.

I know there's more but that is what comes immediately to mind. I hope we find a solution that supports everything we need well.

jpmedley commented 5 years ago

I started reading Eric's comments with the intent of arguing against him because Google has a site where we've successfully used markdown for years.

But he convinced me.

We frequently mix markdown and HTML specifically because of figures and videos like the ones he mentions. After reading the list of additional ways that MDN would have to mix md and HTML, it appears that you will likely only be saving us from typing a handful of tags: hn, p, code, and pre.

(By the way, I think the code problem alluded to was specific to your engine. Google's site has never had this problem that I am aware of.)

a2sheppy commented 5 years ago

@jpmedley Yeah, it's really a matter of how intricate the content is in and how much of a focus you put on detailed presentation with embedded examples and whatnot. The more complex the content, the harder it is to shoehorn it into a Markdown world comfortably. We have a lot of figuring out to do if we really are going to migrate back to Markdown after all these years.

wbamberg commented 5 years ago

I don't expect that in this future MDN you will be able to do all the same things you could do in the old one, and I don't particularly think this is a bad thing. To make a very close analogy: you don't have as much freedom for how to represent compat data now as you used to have, and the payoff for this is (1) highly consistent tables (2) the ability to easily change the appearance of tables (3) authors don't have to hand-craft the tables, and can just focus on the content.

So I don't think we should approach this like: "we can do X now: therefore our replacement must also do X". Instead we want to understand better what are the things that an authoring format absolutely must support, and to do this we have chosen to experiment with Markdown. I agree that Markdown is very limited. But choosing an authoring format is going to be an exercise in compromise. There isn't a perfect authoring format: if there were, everyone would be using it.

So when we encounter things in MDN that Markdown can't support, we need to ask questions like:

do we actually need to be doing these things, or are there simpler alternatives that are as good?
if we do need them, are there solid Markdown extensions we can use?
if not, is this a common enough case that dropping into HTML is not going to be OK?

Maybe Markdown will turn out to be too limited. We'll find that out by trying to migrate pages and asking the questions above, and if we do, we'll have to think again. Do you have a suggestion for a format that would be better? (I don't think HTML + CKEditor is better: for any nontrivial edit I usually find myself in the source view anyway, many of our pages contain junk HTML from people pasting things into the editor, and reviewing diffs is really difficult.)

Finally: for any content we structure, the pages are going to be built by software, and this will sometimes help with elements that aren't supported in Markdown. For example, the list of HTML element attributes are rendered as a <dl> from a list of attributes in the structured content. So you still get the <dl> in the rendered page, without Markdown needing to understand it. This kind of thing covers most of the uses of <dl> that we have in MDN (for example, lists of properties of an interface or lists of parameters to a function).

ddbeck commented 5 years ago

Will has said nearly everything I started to write last night, but I wanted to add a few things:

I agitated for this process of making an explicit choice of Markdown specification and implementation because I think Markdown is flawed. I'm a Markdown hater. I will happily talk to anyone—at embarrassing length—about what's wrong with Markdown¹. I do not think for a second that we will escape all of Markdown's shortcomings (though, in a lot of cases, it's no great harm to this project, as we already have many of those shortcomings and worse, on the wiki right now). But I think we can avoid the worst shortcomings of Markdown by being smart about how we use it, particularly in ways that decouple the way we author content and the way it's presented on the final page.

¹Though it pains me to acknowledge this, Markdown has fewer problems in 2019 than it did five years ago. I have fewer bad things to say about Markdown today and that makes me wistful.

mdn / sprints

Choose a Markdown format for GitHub-hosted content #1505

Acceptance criteria

Requirements

Proposal: use remark (via unified)