We want to lint the WebAPI docs, but at this point we are only concerned with the most important WebAPI docs, which we call "P1 docs". We expect this to consist of about 1100 pages (out of ~5000 pages total under https://developer.mozilla.org/en-US/docs/Web/API.

In this story we will define precisely what is in P1, and thus what's in scope for the current round of linting.

Acceptance criteria

[ ] We have a clearly defined set of P1 WebAPI docs.

This issue might be useful to decide what you want to call a p1 API page: https://github.com/mdn/browser-compat-data/issues/5674

We also have the scoping exercise in https://docs.google.com/document/d/1rHSMMyM4RSFjttXvWLwaqWDQiVzPlAQRqnchhLfp0tg/edit#heading=h.kq7cdsyjkr1c - this is a bit old but still probably mostly relevant.

Principles

I'd like to frame this work with the following principles:

choosing the set of P1 docs should be governed by two main factors:
- how much traffic the pages get
- our subjective view of how important a particular doc set is
we should prioritize coherent sets of pages. So rather than pick individual pages, we should pick complete WebAPI interfaces, even if that's not optimal from a traffic perspective.
we should aim to pick not much more than 1000 pages.

What we have in Web/API

There are 5596 pages under Web/API (https://wiki.developer.mozilla.org/en-US/docs/User:wbamberg/all-api-pages?raw&macros).

There are 1095 pages at the top level.

Fundamentally this documentation consists of two interleaved but semantically quite distinct documentation hierarchies.

some top-level pages are "API overview pages", that give an overview of what Web/API calls an "API": this is an abstract collection of objects, methods, events that maps roughly to a specification. For example, the Fetch API or the Geolocation API. Under these pages, there are guide pages for that API.
some top-level pages are "Interface" or "Dictionary" pages, that provide reference documentation for concrete Web programming objects, like Request and Response. Under these pages are reference pages for the members that these objects contain, like Response.status.

API overview pages

We have 98 API overview pages (defined as top-level pages with a space in their name). Under these are (supposed to be) guide pages. In total we have 171 of these guide pages.

We also have the GroupData.json KumaScript macro, which is supposed to be a list of all the APIs we document. This includes the URL for the overview page and the list of interfaces and dictionaries that the API contains.

Ideally, there would be one entry in GroupData of these for every API overview page in Web/API. Actually there are only 87 objects in GroupData, and only 68 of these appear in the set of actual API overview pages.

It's tempting to use the higher-level concept of APIs as a way to get a handle on the scale of the Web APIs - we could for example say "Fetch is a P1 API". But the data here contains enough errors and omissions that this is hard to do.

Interface and Dictionary pages

This leaves us trying to define priorities at the level of interfaces (and sometimes dictionaries). There are about 997 such pages directly under Web/API.

Analysing the interfaces

I've taken the hierarchy of pages under Web/API and added traffic data to it, to make this spreadsheet: https://docs.google.com/spreadsheets/d/1UTAQG3pSrdBD2tIXMWRSATkOSSj-dcuXnG7ux5pFtDk/edit#gid=538131955.

This has a row for every top-level page that's not an API overview page. In each row, it lists:

interface traffic: how much traffic all the pages in this interface get
interface page count: the number of pages in this interface (the top level page plus its children)
traffic-per-page: calculated as interface traffic/interface page count)
traffic percentage: calculated as (interface traffic / total traffic) 100. Note that total traffic* here is exclusive of traffic to "API overview" pages and their children.

With this sheet we can list the top 10 interfaces by size:

Window                   181
Document                 161
Element                  134
WebGLRenderingContext    116
GlobalEventHandlers       76
CanvasRenderingContext2D  68
WebGL2RenderingContext    68
RTCPeerConnection         67
HTMLMediaElement          66

If you sort by Traffic/page, you can capture 77.52 % of traffic by including the following interfaces, which contain 1113 pages:

EventTarget
FormData
FileList
MediaDevices
ElementCSSInlineStyle
Blob
MutationObserver
HTMLOrForeignElement
DOMString
Event
EventListener
DOMParser
File
XMLHttpRequest
WindowOrWorkerGlobalScope
History
URLSearchParams
Geolocation
USVString
Storage
FileReader
ChildNode
Body
CustomEvent
Element
Response
URL
KeyboardEvent
ParentNode
HTMLElement
Location
HTMLCanvasElement
WebSocket
console
NodeList
Crypto
Window
Document
NavigatorOnLine
WindowEventHandlers
HTMLAudioElement
Node
HTMLCollection
Request
HTMLTextAreaElement
ResizeObserver
Cache
HTMLDivElement
Clipboard
HTMLInputElement
DOMHighResTimeStamp
HTMLOptionElement
NavigatorLanguage
Headers
HTMLFormElement
AbortController
SubtleCrypto
HTMLTableCellElement
StorageEvent
ClipboardEvent
ScrollToOptions
CanvasRenderingContext2D
CanvasGradient
CryptoKey
MediaStreamConstraints
ImageData
MutationRecord
XMLSerializer
Navigator

This seems like a good initial proposal for P1 Web/API docs. It is entirely based on traffic though, so it would be worth scanning the other interfaces to see if we are missing any that we would like to include. Note though that we are already at 1100 pages, so any proposals to add new interfaces should be accompanied by a proposal to remove one :).

Thanks Will! This is amazing research and very data-driven. Love it! I agree to the principles you've chosen.

After choosing p1 pages by traffic, the second principle "we should prioritize coherent sets of pages" resonates a lot with me and I think this is a point where the final list of pages might change a bit as we go. I agree that if we do that, we should think about what to remove from the list, too.

An attempt to cluster your list:

// Networking
AbortController
Body
Cache
Headers
Request
Response
WebSocket
XMLHttpRequest
XMLSerializer

// Canvas
CanvasGradient
CanvasRenderingContext2D
ImageData

// DOM
ChildNode
Blob
DOMParser
DOMHighResTimeStamp
DOMString
Element
ElementCSSInlineStyle
Node
NodeList
ParentNode
USVString

// Clipboard
Clipboard
ClipboardEvent

// Crypto
Crypto
CryptoKey
SubtleCrypto

// Events
CustomEvent
Event
EventListener
EventTarget
KeyboardEvent

// Files
File
FileList
FileReader

// HTML
FormData
HTMLAudioElement
HTMLCanvasElement
HTMLCollection
HTMLDivElement
HTMLElement
HTMLFormElement
HTMLInputElement
HTMLOptionElement
HTMLOrForeignElement
HTMLTableCellElement
HTMLTextAreaElement

// URLs
History
Location
URL
URLSearchParams

// Media
MediaDevices
MediaStreamConstraints

// Observers
MutationObserver
MutationRecord
ResizeObserver

// Fundamentals
Document
Navigator
NavigatorLanguage
NavigatorOnLine
Window
WindowEventHandlers
WindowOrWorkerGlobalScope

// Misc
Geolocation
ScrollToOptions
Storage
StorageEvent
console

Now, I guess you can imagine that each of these clusters are a work package, but they are incomplete, because, for example, when you work on Canvas, you probably want to fix all of the Canvas API page structures instead of just the most trafficked. Maybe there are even dependencies with the non high traffic pages, because it turns out that some canvas (mixin, dictionary) pages need splitting or merging. So, my feeling is that in that case, it makes sense to look at all interfaces of a cluster (see here for canvas) and make them all fit the correct recipes. Does that make sense? I guess I'm coming from a more holistic approach, thinking that no one wants to dig into the other half of an API cluster again after we fixed the first p1 part. That way we would also have good examples of whole API clusters that follow the newly defined and correct doc structures that shape our way forward into more API clusters we want to fix or document.

Well... re clusters. We already have "clusters", of a sort, they are what's defined in GroupData.json. But as said above, the data in GroupData is incomplete and inconsistent. We could fix all that before starting the WebAPI linting. Why though? It would take a bunch of time and it's not clear how it really supports the work of linting our docs. By just looking at the interfaces level, we can start fixing up individual pages without looking at fixing the higher-level organization. If fixing the higher-level organization of Web/API is a goal of this project, then fine, but that's definitely a change in scope (this US is scoped at 1 point, hilariously).

Of course we can also say "the clusters in GroupData are no good, we should invent new ones". But I'd be very careful here. Grouping things into categories is always really tricky: there are like 30% of obvious cases, another 50% of cases where categorization is very subjective, and 20% that you just have to lump into "Miscellaneous", which is terrible because then people have to look in two places to find something. In your example you've taken Canvas, but then actually pointed to GroupData (essentially) for the definition. But what about, say "Fundamentals"? what are all the things we should add to that, to have a "complete" cluster? Are, say, Service Workers fundamentals? And looked at from the point of view of a developer, what sorts of things are considered fundamental? I expect you'll get different answers from different people.

At least GroupData categories are backed by a real thing - the spec that defines the interfaces - rather than just the intuitions of a particular tech writer on a particular day.

I'm not exactly against defining new groups, especially higher-level ones than those in GroupData. But it's hard, and it takes time, and needs to be done as a complete project IMO.

no one wants to dig into the other half of an API cluster again after we fixed the first p1 part

Well...maybe. The thing about doing complete interfaces is that at least each subtree of the hierarchy is done at a time, and that seems like the most important thing.

But. I do think that higher-level abstractions are helpful here. So for example the "Fetch API" abstraction that unites Request, Response, Body, Headers. I think users probably think in terms of "Fetch" not in terms of the individual interfaces. So we could say: from the 70 interfaces in my list, look at the groups they belong to and list any interfaces that are omitted from the 70, and work out how many extra pages that would amount to. Another, related, thing is that in https://docs.google.com/document/d/1rHSMMyM4RSFjttXvWLwaqWDQiVzPlAQRqnchhLfp0tg/edit# we made a list of "P1 APIs": it would be good to get a sense of:

which of these APIs are omitted completely from my 57 interfaces. Do we want to add them, or has the P1 definition changed since we wrote this?
which of these APIs are only partially covered by my 57 interfaces, and do we want to add the remaining interfaces, so as to have complete APIs?

I've added a new field to the spreadsheet, which represents the group that the interface belongs to, as defined in GroupData.

Note that 252 top-level pages - a quarter of the total - are not assigned to a group at all. Also, 32 interfaces are listed under two - or more! - different APIs in GroupData. So I have had to pick the most likely option in those cases. We really, really, ought to clean this data.

Anyway, we can use this to see how much extra work it would be to lint complete groups. I've added another sheet, "groups". This contains a row for every group that's represented in our 57 interfaces.

Each row contains:

name of the group
number of interfaces belonging to that group that are in our 57 interfaces
number of pages represented by that collection of interfaces
how many more pages we would have to lint, to complete that group

For example, four of our interfaces, Request, Response, Body, Headers, belong to the "Fetch API" group and comprise 47 pages. And this is the totality of the Fetch API. But we'd need to add another 18 pages to complete the Geolocation API.

In total we can see that to complete all API groups included here would mean linting 977 more pages, or almost double the total. An especially problematic group is "HTML DOM", which would commit us to 395 more pages. And then there are groups like "Service Workers API", where we're proposing to lint just 8 pages but would need to add another 97 to complete the group.

On the other hand, we're at 55/61 pages in the "XMLHttpRequest" group, so it seems very worthwhile to finish this one...

I'm confused about what's in "DOM" and what's in "HTML DOM". In GroupData, Window is in "HTML DOM", not in "DOM". But in the DOM overview page, Window is listed as a "DOM interface". Sigh. Is this kind of categorization really useful to people?

Also a quick look reveals a lot of DOM interfaces that are obsolete.

So although I was reluctant to define new groups across the board, it might be worth thinking about reworking this area.

Thanks for your detailed comments!

Well... re clusters. We already have "clusters", of a sort, they are what's defined in GroupData.json. But as said above, the data in GroupData is incomplete and inconsistent. We could fix all that before starting the WebAPI linting. Why though?

Right, groups exist already. I was thinking more about how to proceed practically workload-wise. I mean, we could just pick random interfaces and lint them, or could we cluster things we identified as p1 and work on cluster after cluster. Maybe just linting whatever we come across in the p1 list is fine, though, as you say. And if that's the most practical way forward, lets do that.

I'm not exactly against defining new groups, especially higher-level ones than those in GroupData. But it's hard, and it takes time, and needs to be done as a complete project IMO.

Yeah. I thought of the clusters as practical slices to work from, not exactly aiming to re-invent groupdata. I'm not sure how useful the current groups are to our readers and as you say that is a higher level problem. I think the other areas (JS, for example) didn't suffer from these higher level issues, so maybe my (naive) hope was that we can tackle it somehow, but you are right in identifying this as its own project / user story. It is quite complex.

I've added a new field to the spreadsheet, which represents the group that the interface belongs to, as defined in GroupData.

Thanks, this is useful!

In total we can see that to complete all API groups included here would mean linting 977 more pages, or almost double the total.

Wow, this is too much indeed

And this is the totality of the Fetch API But we'd need to add another 18 pages to complete the Geolocation API. problematic group is "HTML DOM", which would commit us to 395 more pages "Service Workers API", where we're proposing to lint just 8 pages but would need to add another 97 to complete the group. we're at 55/61 pages in the "XMLHttpRequest" group, so it seems very worthwhile to finish this one.

This is amazingly useful to know per each API group! Thanks for making this analysis, Will! I agree that XHR for example is something I would finish in totality given this data. Can we make the call for each current API group based on these numbers? For me, XHR would be in in totality, HTML DOM wouldn't be.

ServiceWorkers is really surprising here. I wonder why things are like that. Is the rest very poor/useless pages? Are we creating many pages (say for dictionaries, enums, mixins, etc.) for an API, but most people just read the main interface and/or method pages? So, here I struggle to make a clear call if we should say lets do ServiceWorkers in totality given my lacking sense of the docs.

So although I was reluctant to define new groups across the board, it might be worth thinking about reworking this area.

I agree this needs work. It is worth splitting this out into an own "re-grouping" user story. I don't know how much it rabbit-holes into defining P1 API docs, but it seems that your list above is still useful and our current best bet on what we think the P1s are. There are two options, I guess:

We could enhance the list by deciding per each API group if we do things in totality or not. We decide also with the total number pages in mind (doubling to linting 2000 pages as P1 isn't practical)
We could work through the list as is and bother about groups another day.

I wanted to briefly chime in here: I think the analysis is really good. I like the principles you've selected to selection. I itch a bit a wholly traffic-driven approaches and I like the way you've handled that.

I'm also pleased to see that these principles satisfied another interest I had, that the pages we selected represented a good range of page types, to ensure we get a mostly complete set of recipes. Based on the selection, this seems likely (even if we lost some pages on the margins—more on that in a second).

I also favor completing groups, since I suspect this will improve the completeness of our recipes by drawing in more long-tail pages. But I'm also reluctant to balloon the number of P1 pages. Maybe we could complete the groups which are already very-well covered (e.g., XMLHttpRequest, which is already at 90%), but not ones that aren't close (e.g., HTML DOM, at 47%).

Or if you really want completeness in one area (say the HTML DOM group) then you might let completeness cut both ways: Media Capture and Streams and Service Workers API's selected pages are less than 10% of their respective groups. If you dropped the entirety of those groups, you'd shed over 200 pages for reallocation elsewhere. Or you could do some combination: shedding groups that are poorly covered (say <10% coverage), completing groups that are well-covered (say >90% coverage), and letting the groups in between be incomplete.

Thanks for the comments! Then here's my suggestion. We lint complete groups, with three exceptions, "DOM", "DOM Events", and "HTML DOM". Reasons for excepting these are:

taken together these three groups contain 1428 pages. We can't do them all unless we lint no other groups, and even then we go 50% over budget. But, these groups include some basic interfaces that we must consider P1s. So we can't exclude them.
the definition of these groups is weird, possibly unhelpful, certainly inaccurate, and contains many obsolete interfaces. So we should fix them up before leaning too heavily on them.

So for these three groups only, we define a subset of the interfaces as P1s, based on traffic. For the other groups, we only lint complete groups. That means we either expand the set of interfaces in a group, so as to complete the group, or we remove the set, so as to omit the group completely.

Also, we will throw in the 5 interfaces (56 pages) that are not currently assigned to any group.

Given those principles I've added yet another sheet "P1 docs" that lists my concrete proposal, and that I'll copy here:

   Group         | High-traffic interfaces | High-traffic pages | Low-traffic pages | Total pages |  
---------------------------------------------------------------------------------------------------
(unassigned)     |        5                |         56         |       n/a         |      56     |  
Canvas API       |        1                |         69         |        42         |     111     | complete
Clipboard API    |        1                |          5         |         7         |      12     | complete
DOM              |       14                |        400         |       179         |     400     | partial
DOM Events       |        2                |         20         |        71         |      20     | partial
Fetch API        |        4                |         47         |         0         |      47     | complete
File API         |        4                |         43         |         5         |      48     | complete
HTML DOM         |       13                |        363         |       395         |     363     | partial
URL API          |        2                |         32         |        24         |      56     | complete
Web Storage API  |        1                |          7         |         1         |       8     | complete
Websockets API   |        1                |         18         |        10         |      28     | complete
XMLHttpRequest   |        3                |         55         |         6         |      61     | complete
---------------------------------------------------------------------------------------------------
                 |                         |       1115         |                   |    1210     |

This lists P1 docs by group:

"High-traffic interfaces" - the number of interfaces in that group that we identified as high-traffic
"High-traffic pages" - the number of pages in those high-traffic interfaces
"Low-traffic pages" - the number of extra pages we'd have to add to complete the group
"complete/partial" - whether we are actually going to add those low-traffic pages for this group.

As you can see from a comparison with the "groups" sheet, I've removed several groups for which many pages were missing, or that just didn't seem that important.

This gives us 1210 pages as P1.

I'm really happy with this proposal. I think all the groups listed here are important Web APIs, and deserve to be considered P1.

Other things to come out of this work:

file an issue to update GroupData so all interfaces belong to a group
file an issue to update GroupData so no interfaces belong to more than one group
file issues to remove some obsolete interfaces
think about improving the DOM/HTML group
think about defining higher-level categorizations for the Web API docs as a whole (imagine having a https://developer.mozilla.org/en-US/docs/Web/API top-level page split up by: "Graphics", "Media" "Debugging", "Network" or something like that...)

mdn / sprints

U - Define P1 WebAPI docs #3327