MDN as a repo of structured content

dontcallmedom commented 3 years ago

There have been discussions as part of the Yari transition that the same way BCD has turned to be a very useful source of reusable data, a lot of the pieces of content that constitute an MDN page could also be re-used in other contexts. More generally, moving some of the prose content into more structured data can help with automating and systematizing how MDN pages represent content.

A lot of this structuring has been happening incrementally but it would probably be useful to get a clearer picture of what the goals and constraints would be, possibly recruit consumers of this structured data to make sure that as these incremental improvements land, we're not missing opportunities and creating issues toward that plan.

wbamberg commented 3 years ago

@ddbeck @escattone , you might be interested in this issue.

As you might imagine, I have feelings about this. Sorry for the disorganized nature of this but I hope it is somewhat helpful.

I think ideally, rather than plan a giant project, we could consider this as a long-term vision for MDN, and have shorter projects that deliver value on their own but also get us closer. We can use this overarching goal to see whether individual projects do get us closer, and to guide the way we execute these projects. I hope that makes sense. For example, the work on spec URLs delivers real value now, but we can also see how it gets us closer to structured content (by building more of our pages from data), and we can see how particular choices about how we've done it (like starting to use front matter) fit into an overall vision of where we want to be.

What things should we structure?

We can sketch out some ideas for things we could structure (i.e. pull out of the unstructured prose content and represent as data). We've already done a lot of this work in the past, and there are conversations going on now about possible items. If we want to make parts of MDN content available to other tools and applications, we should also ask potential consumers which specific things they would find useful.

Page types

One thing I think we'll need sooner rather than later is an explicit representation of page types: that is, is this a JS method page, or a CSS property page, or a WebAPI interface page...?

The reason is that some pieces of structured content apply to some but not all MDN pages. Even something relatively universal as BCD or spec URLs only applies to reference pages, not guide pages, and our inability to represent that is already causing us problems (see https://github.com/mdn/content/issues/4574). But suppose we want to represent the permissions that a Web API needs. This seems like a really obvious thing to structure. But it's only applicable to Web APIs. So it would be great to have a way to represent this. In stumptown we had page types that mapped to "recipes", that told you which things a particular type of page could (or must) contain. If you do that you can lint for these things, and generally rely on particular types of pages having a particular collection of stuff.

There's actually nothing in Yari stopping us from adding page types right now, and making use of them in KS at least: https://github.com/mdn/yari/issues/3350.

Where should we keep data?

There are three things worth thinking about here:

front matter: great as a way to associate a quite compact collection of data. A big advantage of front matter as an author is that the data lives right next to the prose. I think a big drawback of mdn/data is that the data (say, whether a CSS property is animatable) lives all the way over there, when I am writing the page. So it's this extra task that gets forgotten or out of date. But we don't want hundreds of lines of front matter. So we sometimes want to use the front matter to refer out to another data source (as we do for BCD). And we want to use discipline, and be careful, in what we choose to include here.
BCD: this is mature and well maintained. I'm a bit anxious that we might end up just stuffing things in here that don't belong here, because it's easy. IMO BCD is properly for data that's specific to particular browser implementations of a thing, and should not document specced features. I think we need a common conception of what belongs in BCD and what doesn't.
mdn/data: this gets its own section, below.

Dealing with mdn/data

The mdn/data repo is a kind of proto-structured data, and although it was a before its time idea, I think it's past time that we moved on from it. Most of the data here is for CSS, and that is in 2 main parts, I think:

general CSS data, that ends up in tables like https://developer.mozilla.org/en-US/docs/Web/CSS/margin-top#formal_definition via an extremely hairy macro. We have before discussed whether we should remove some of these items, and have even removed some. Ideally I think we could remove some, and move the rest into front matter.
CSS formal syntax. We have debated for years what to do about this: whether we should remove it entirely, hide it, or try to make it better (https://discourse.mozilla.org/t/pretty-printing-the-css-formal-syntax/24588, https://discourse.mozilla.org/t/pretty-printing-the-css-formal-syntax-part-2/27330). I think ideally we'd keep it in the MDN pages and make it better. One idea I like a lot is to ask the CSS WG to maintain the data itself, because they are really the arbiters of whether it's correct. I don't know if they would go for that.

So maybe by some combination of these things we could retire mdn/data finally, or at least hone it down to mdn/csssyntax.

Structuring prose

We should consider that not only data wants to be free. Some applications want to show "short descriptions" for things like CSS properties, and these are fundamentally prose. Or imagine wanting to extract the "accessibility concerns" for all the HTML elements. In stumptown we thought about also supporting this, so the prose had a guaranteed structure that was kind of addressable.

Intermediate formats

Sort of related to this, we might want to consider an intermediate format for MDN content for consumers, that's more convenient and perhaps more stable than directly accessing the things under content/files. For example, if we want to structure prose we would probably have to do some ugly stuff around slicing up MD files at H2 boundaries, and dealing with variations introduced by translations, and so on. It would be better if we did that on behalf of consumers, and provided them with a clean way to say "give me the short descriptions for CSS properties".

dontcallmedom commented 3 years ago

Replying to a specific low-hanging fruit here:

CSS formal syntax. We have debated for years what to do about this: whether we should remove it entirely, hide it, or try to make it better (https://discourse.mozilla.org/t/pretty-printing-the-css-formal-syntax/24588, https://discourse.mozilla.org/t/pretty-printing-the-css-formal-syntax-part-2/27330). I think ideally we'd keep it in the MDN pages and make it better. One idea I like a lot is to ask the CSS WG to maintain the data itself, because they are really the arbiters of whether it's correct. I don't know if they would go for that.

We actually already have that automatically extracted from CSS specs in our webref project, and even released it as an NPM package (although the package is not being systematically updated yet): https://github.com/w3c/webref/tree/master/ed/css

As for other spec data, we would be more than happy to adjust the extracted information to fit the need of MDN.

wbamberg commented 3 years ago

Replying to a specific low-hanging fruit here:

CSS formal syntax. We have debated for years what to do about this: whether we should remove it entirely, hide it, or try to make it better (https://discourse.mozilla.org/t/pretty-printing-the-css-formal-syntax/24588, https://discourse.mozilla.org/t/pretty-printing-the-css-formal-syntax-part-2/27330). I think ideally we'd keep it in the MDN pages and make it better. One idea I like a lot is to ask the CSS WG to maintain the data itself, because they are really the arbiters of whether it's correct. I don't know if they would go for that.

We actually already have that automatically extracted from CSS specs in our webref project, and even released it as an NPM package (although the package is not being systematically updated yet): https://github.com/w3c/webref/tree/master/ed/css

As for other spec data, we would be more than happy to adjust the extracted information to fit the need of MDN.

I filed https://github.com/openwebdocs/project/issues/44 for this bit :).

Elchi3 commented 2 years ago

I'm going to close this issue. We developed a theme for us to work towards structured context and allowing docs to be used in multiple context, see https://github.com/openwebdocs/project/blob/main/steering-committee/themes.md#offer-structured-content-and-data-for-documentation-to-be-used-in-multiple-contexts

Also, we worked on quite a few projects towards this theme and will continue to do so incrementally. (integrating web ref, generated spec sections, css syntax, page types, etc.)

If you have a specific idea that contributes to our theme, feel free to file it as a new OWD project proposal. If you think our theme needs to be updated, then I'd suggest to file a new issue, too.

openwebdocs / project