w3c / webref

Machine-readable references of terms defined in web browser specifications
https://w3c.github.io/webref/
MIT License
295 stars 73 forks source link

Export WebIDL fragments #3

Closed dontcallmedom closed 6 years ago

dontcallmedom commented 6 years ago

To help with https://github.com/GoogleChromeLabs/webidl-diff/blob/dataCollection/data/FetchSpecRunner-WebSpec-journal.js and imports into web-platform-tests

Ideally: one file per spec, whose name match the spec shortname

mdittmer commented 6 years ago

@dontcallmedom looks like there's great progress here! I tried running the crawler on one spec to get a feel for things, and saw that there is now an ouput_dir/idl directory. Is this ready to be included in updates to reffy-reports?

mdittmer commented 6 years ago

@lukebjerring FYI: Once this issue is resolved, we should be able to pull per-spec Web IDL fragments from reffy-reports in WPT (and anywhere else that needs them).

dontcallmedom commented 6 years ago

the idl fragments are now published at https://github.com/tidoust/reffy-reports/tree/master/whatwg/idl and will get updated on a daily basis; there are probably still some gaps and additional QA needed.

lukebjerring commented 6 years ago

Very cool - we can script a PR for replacing the web-platform-tests/interfaces with these to get a quick snapshot of the differences.

lukebjerring commented 6 years ago

Quick diff of GoogleCloudLabs/webidl-diff export vs tidoust/reffy-reports

EDIT: w3c diff: https://github.com/lukebjerring/web-platform-tests/compare/idl-file-updates~2...lukebjerring:idl-file-updates~1?expand=1

whatwg diff: https://github.com/lukebjerring/web-platform-tests/compare/idl-file-updates~2...lukebjerring:idl-file-updates?expand=1

lukebjerring commented 6 years ago

@foolip - you're one of the best-equipped to summarize and major red flags in the diff. Would you be able to take a look?

foolip commented 6 years ago

Overall this looks great, and clearly the overlap is so big that it doesn't really make sense to maintain two pipelines that both achieve roughly the same outcome, we should pick one and fix the bugs :)

From a quick glance it looks like most of the differences should be whitespace or because of spec changes that have happened in between the scraping of the two different tools. Interesting this that aren't that:

2dcontext.idl, eventsource.idl, webmessaging.idl, webstorage.idl and workers.idl: @dontcallmedom where were these extracted from, and how were they split into these files? Any naive approach would results in a single html.idl file, which is actually what I think we should have.

DOM-Parsing.idl fixed the bug with serializeAsCDATA appearing in the webidl-diff tooling.

appmanifest.idl: @dontcallmedom how was the file name chosen? It should be extracted from https://w3c.github.io/manifest/, so do you look for a TR version and use just the name from that? Same question about csp.idl, generic-sensor.idl and a few others.

cssom.idl: reffy correctly deletes the _camel_cased_attribute bits. @dontcallmedom, how was that achieved?

custom-elements.idl and shadow-dom.idl are IDL file I think we don't want to include in WPT, as the specs have been folded into WHATWG DOM+HTML.

dom.idl: rather many changes, more than seem like they could be explained by the time between scraping. @dontcallmedom, what version of the spec was scraped for this?

selectors-api.idl seems to be from the very dated https://www.w3.org/TR/selectors-api/ and shouldn't be included in WPT.

spec.idl: @dontcallmedom seems like a bug in the file naming logic?

tidoust commented 6 years ago

@foolip @lukebjerring, I believe you're running the diff against the wrong report...

As explained in https://tidoust.github.io/reffy-reports/, there are 3 different reports in the repository:

  1. a W3C-centric report based on Editor's Drafts
  2. a W3C-centric report based on latest published versions in /TR/
  3. a WHATWG-centric report

The report you probably want to use is 3, and that's the one @dontcallmedom suggested in https://github.com/tidoust/reffy-reports/issues/3#issuecomment-378167529 (reffy-reports/whatwg/idl), but you seem to have used 1 (reffy-reports/w3c/idl) in https://github.com/lukebjerring/web-platform-tests/commit/33985ec2c90ab6827c0fbc87e723082e5cdc8870.

This explains most of the diffs you're wondering about:

FWIW, the list of specs common to all reports is defined in specs-common.json. The WHATWG-centric report contains these specs, plus those defined in specs-whatwg.json. The W3C-centric report contains the same common specs, plus those defined in specs-w3c.json.

The starting point is indeed the URL in /TR/, and the shortname is used to name the IDL file, which explains why we end up with appmanifest.idl and generic-sensor.idl. The version number is dropped from the shortname, so csp.idl and not csp3.idl. The getShortname function holds that logic. It can certainly be modified and it needs to be fixed in any case, since the spec.idl file you mention shows it does not really generate the name one would expect on non /TR/ URLs...

For cssom.idl, Reffy extracts the IDL from the IDL Index section when it exists. That section does not have the _camel_cased_attribute bits. I do not know how that was achieved in the spec itself, but that's where the magic is...

selectors-api.idl could indeed be removed from specs-common.json.

foolip commented 6 years ago

@lukebjerring, could you create a new diff with the WHATWG-centric report? There may still be some odd things remaining, but hopefullt that'll resolve most of the issues.

@tidoust, I've sent https://github.com/tidoust/reffy/pull/99, but am also curious about how the list of specs is maintained? I maintain a list for https://foolip.github.io/day-to-day/ using a combination of specref and manual additions, and it would be nice to converge on a single list of "specs that matter" for purposes like these.

tidoust commented 6 years ago

@foolip, the list is maintained manually. We have a small tool that reports on specs we may have missed, but that's all for now.

Also, the list mostly contains specs that define some IDL content for now. For instance, many CSS specs are missing. There is a priori no problem adding them, it's just that Reffy does not really know how to analyse them and report useful info (right now, it will report that these specs may have a problem because it cannot extract any IDL...).

foolip commented 6 years ago

Thanks @tidoust! @tabatkins @jstenback FYI that this is another list of specs that's being maintained by hand. Since the needs are slightly different for each list maybe fixing the duplication "problem" isn't warranted, dunno.

lukebjerring commented 6 years ago

Updated the diff link(s) above. Might have to re-run the up-to-date pipeline to reduce noise, but PTAL

foolip commented 6 years ago

https://github.com/lukebjerring/web-platform-tests/commit/14b4e1051af54874bc36618dc0d48bcb49e7b698 looks good. @lukebjerring, does that also delete any files that aren't in reffy, so we can see if we've found any specs reffy has not?

lukebjerring commented 6 years ago

No; it only copies from whatwg/idl into that dir, so you would have to take the inverse of affected files vs existing files to see what we have that they don't.

foolip commented 6 years ago

With https://github.com/lukebjerring/web-platform-tests/commit/14b4e1051af54874bc36618dc0d48bcb49e7b698 checked out:

$ comm -23 <(find interfaces -name '*.idl' | sort) <(git diff --name-only HEAD~~ | sort)
interfaces/animation-worklet.idl
interfaces/aom.idl
interfaces/background-fetch.idl
interfaces/BackgroundSync.idl
interfaces/battery.idl
interfaces/budget-api.idl
interfaces/compat.idl
interfaces/cookie-store.idl
interfaces/cors-rfc1918.idl
interfaces/css-animations.idl
interfaces/css-conditional.idl
interfaces/css-device-adapt.idl
interfaces/css-fonts.idl
interfaces/css-layout-api.idl
interfaces/css-masking.idl
interfaces/css-transitions.idl
interfaces/dedicated-workers.idl
interfaces/deviceorientation.idl
interfaces/feature-policy.idl
interfaces/filter-effects.idl
interfaces/geolocation-sensor.idl
interfaces/InputDeviceCapabilities.idl
interfaces/keyboard-lock.idl
interfaces/media-capabilities.idl
interfaces/mediacapture-image.idl
interfaces/mediacapture-main.idl
interfaces/mediacapture-output.idl
interfaces/mediacapture-record.idl
interfaces/netinfo.idl
interfaces/picture-in-picture.idl
interfaces/remoteplayback.idl
interfaces/ResizeObserver.idl
interfaces/scroll-animations.idl
interfaces/sensors.idl
interfaces/ServiceWorker.idl
interfaces/shape-detection-api.idl
interfaces/touchevents.idl
interfaces/webappsec-credential-management.idl
interfaces/webappsec-csp-embedded.idl
interfaces/webappsec-csp.idl
interfaces/webappsec-referrer-policy.idl
interfaces/webappsec-secure-contexts.idl
interfaces/webappsec-subresource-integrity.idl
interfaces/web-audio-api.idl
interfaces/web-bluetooth.idl
interfaces/webidl.idl
interfaces/web-midi-api.idl
interfaces/web-nfc.idl
interfaces/webrtc-pc.idl
interfaces/web-share.idl
interfaces/webusb.idl
interfaces/webxr.idl

Seems like there are quite many files missing in reffy, or some other mistake. @tidoust, do you know why interfaces/css-fonts.idl isn't included, for example?

tidoust commented 6 years ago

@foolip, thanks for the diff. I had a look at it. Answer about css-fonts inline...

Specs I'm not clear what to do with:

Specs we have under a different name in Reffy, because IDL filenames are named after the /TR/ shortname:

Then there are specs we have and for which Reffy failed to generate IDL files. Problem is Reffy only generates these files when the IDL is valid, and the IDL parser reports some errors for:

We should probably rather save the IDL file no matter what. I created https://github.com/tidoust/reffy/issues/103 to track this.

I added all other missing specs to Reffy (adding specs developed in the WICG and other CGs in particular). Local tests suggest things work fine, but note that some specs will appear under a different name for the same reason as above:

Also, note some of the specs I added have invalid IDL, so IDL files won't appear for now either:

NB: The new IDL files may not appear immediately because the server on which Reffy runs daily to update this repo does not seem to be willing to pick up the latest version of the code. That's in the hands of @dontcallmedom...

foolip commented 6 years ago

cookie-store: I see https://github.com/WICG/cookie-store but where is the IDL defined?

Ah, far from all of the stuff in interfaces/ is created from the webidl-diff tooling, it's a mix still. This one doesn't have a spec yet, the IDL is used in a tentative test added by @costan in https://github.com/w3c/web-platform-tests/pull/9455

dedicated-workers: am I right to think that this is defined in HTML?

Yep, https://html.spec.whatwg.org/multipage/workers.html#the-workerglobalscope-common-interface and other bits.

deviceorientation: no longer maintained, so not included in Reffy, but I guess we could add it nevertheless.

Yep. https://w3c.github.io/deviceorientation/spec-source-orientation.html is the only thing that defines this at all, and I think it should be tested in WPT. If we don't get it from Reffy, then manually maintained.

@Honry listed a bunch of the name mismatch issues in https://github.com/w3c/web-platform-tests/issues/9937 and it's not clear what to do about it. Is the idea that these mismatches will just remain forever, or should the next TR publication use a new name of the github.io repos be renamed?

Then there are specs we have and for which Reffy failed to generate IDL files. Problem is Reffy only generates these files when the IDL is valid

Sent PR and create issue: https://github.com/w3c/csswg-drafts/pull/2661 https://github.com/w3c/csswg-drafts/issues/2662

css-conditional: partial interface CSSRule: Missing semicolon after interface

Tried to fix in source but it looks OK already? https://github.com/w3c/csswg-drafts/blob/0f376507e8aa827d4dbb65a76ade819cfbb73657/css-conditional-3/Overview.bs#L819

ResizeObserver: callback ResizeObserverCallback: Unterminated callback

Also seems OK already: https://github.com/WICG/ResizeObserver/blob/351d5d40e7f30e26fcefdb31db5d1bc34474c1de/index.bs#L176

web-bluetooth: dictionary BluetoothPermissionData: Required member must not have a default

Sent https://github.com/WebBluetoothCG/web-bluetooth/pull/395.

Thanks for all of the investigation @tidoust! I think we should be getting pretty close to parity with what we already have in interfaces/. In addition, I wonder what process needs to put in place so that Reffy discovers new specs as they appear, without requiring people to know that Reffy exists and needs to be updated. I have a setup in https://github.com/foolip/day-to-day which while fragile does discover new specs without my intervention. Example. Would you be interested in something similar for Reffy, or what's your ideal state of affairs on spec list maintenance?

tidoust commented 6 years ago

Thanks for following-up with PRs to fix the WebIDL issues!

For css-conditional, that's a bug on our side. For some reason, the W3C API does not manage to extract the URL of the Editor's Draft, so Reffy parses the spec published in /TR/ instead, which has the problem. At a minimum, Reffy should warn when that happens (I created https://github.com/tidoust/reffy/issues/104 to track this). I'll also investigate why the API does not know anything about the ED.

For ResizeObserver, that was fixed in index.bs 3 weeks ago but index.html hasn't been re-generated since early 2017: https://github.com/WICG/ResizeObserver/blob/master/index.html

@Honry listed a bunch of the name mismatch issues in w3c/web-platform-tests#9937 and it's not clear what to do about it. Is the idea that these mismatches will just remain forever, or should the next TR publication use a new name of the github.io repos be renamed?

I'm not entirely clear what you mean with your last sentence. There are multiple places where naming rules can change. In Reffy, we could perhaps be persuaded to switch to the github.io repos when that's possible. We used /TR/ shortnames because these are stable once created and there are cases where it's not necessarily clear how to extract a useful name from the GitHub repo (e.g. https://webassembly.github.io/spec/js-api/index.html). If the IDL files are useful for WPT, we're definitely interested to converge on something that works for you in any case.

It seems valuable to align the repo names with the /TR/ shortnames. Once a /TR/ shortname has been chosen, it mostly cannot be changed. But it should still be easy to rename GitHub repos as needed. And we should encourage the reuse of the repo name for the /TR/ shortname when possible.

In addition, I wonder what process needs to put in place so that Reffy discovers new specs as they appear, without requiring people to know that Reffy exists and needs to be updated. I have a setup in https://github.com/foolip/day-to-day which while fragile does discover new specs without my intervention. Example. Would you be interested in something similar for Reffy, or what's your ideal state of affairs on spec list maintenance?

Give us a bit of time to look into it, but I'm both interested by mechanisms to automate the discovery of new specs, as well as by discussions to reduce/align the sources that contain info about specs (Specref, Reffy, your day-to-day repository, we also maintain a list with links to CanIUse and platform status platforms in our roadmap framework, etc.).

foolip commented 6 years ago

It seems valuable to align the repo names with the /TR/ shortnames. Once a /TR/ shortname has been chosen, it mostly cannot be changed. But it should still be easy to rename GitHub repos as needed. And we should encourage the reuse of the repo name for the /TR/ shortname when possible.

Perhaps a good case to try this with is webrtc vs. webrtc-pc. There's now also https://www.w3.org/TR/webrtc-stats/, and I don't think the WG would be too keen on renaming the main spec back to just webrtc.

I wonder how the scraping works. If you start at a /TR/ link, do you discover the latest version only if it's linked from the TR, and do you pick a name based on the original TR link, or a TR link in the final spec from which IDL is extracted?

tidoust commented 6 years ago

Perhaps a good case to try this with is webrtc vs. webrtc-pc. There's now also https://www.w3.org/TR/webrtc-stats/, and I don't think the WG would be too keen on renaming the main spec back to just webrtc.

Right. To clarify, I'm not saying that /TR/ shortnames are better names. I'm saying they're more stable. Process-wise, the WG cannot easily change the shortname of the published spec: it will remain webrtc. Whereas the WG is entirely free to rename the GitHub repository whenever it wants. Whether webrtc-pc is a better repo name than webrtc is a different matter :)

I wonder how the scraping works. If you start at a /TR/ link, do you discover the latest version only if it's linked from the TR

Mostly. From the /TR/ link, Reffy gets the link to the Editor's Draft from the W3C API. The data returned by that API is created/updated when a spec is published by scraping the links it contains. In a few cases, the data is curated or added by hand afterwards. Typically, the link to the Editor's Draft was not extracted a few years ago, which explains why the URL of the ED was not returned for the CSS Conditional Rules spec (this has now been fixed, and Reffy correctly extracts the IDL from that spec). The URL of the ED may also be added to the database used by the W3C API afterwards even when the /TR/ document does not link to it.

and do you pick a name based on the original TR link, or a TR link in the final spec from which IDL is extracted?

The name is based on the original TR link. But would that make any difference?

foolip commented 6 years ago

The name is based on the original TR link. But would that make any difference?

If the ED doesn't link to TR at all is the case I was considering. Maybe this doesn't happen.

tabatkins commented 6 years ago

It does happen before FPWD publication.

tidoust commented 6 years ago

Right, all specs start as Editor's Drafts which don't link to a document published on TR (or worse, which link to a document published on TR that does not exist yet).

Summarizing, there are two open questions here for W3C specs:

  1. Whether to use TR or ED URLs as input to Reffy. Reffy gathers additional info about each spec from the W3C API before it fetches the spec, and the W3C API only knows about the TR shortname, so I need to stick to the TR URL for now. I'm happy to consider another approach later on.
  2. Whether to base the name of the IDL files generated by Reffy on the TR shortname, or on the GitHub repo name. Here, I'm fine using the name of the GitHub repo if that seems better from a WPT perspective. Is that what you'd prefer to have, @foolip?
foolip commented 6 years ago

I suppose that I'd prefer whichever rule matches most of the existing files in wpt/interfaces/ and wpt/* dirnames. Is https://github.com/tidoust/reffy-reports/issues/3#issuecomment-387550183 still the complete list of mismatched names? In that first long list the bits on the right seem to be the names used in WPT more often, and I think those are the TR names? If so, maybe we should try that and see if we get mismatches that cause trouble? (The webrtc one is particularly annoying, but I guess one-off fixes would also be annoying.)

dontcallmedom commented 6 years ago

My understanding had been that root directories in wpt are supposed to match TR names (although the actual rule is bit fuzzier than that).

A possible additional issue with using github repo names is that some groups (notably css) many specs in a single repository.

Which ever approach we settle on needs to take into account:

On the W3C side, we could try to get more oversight on the naming of github repos in the w3c github organization (which right now is mostly a free for all) if that helps keeping better sync across these various projects.

foolip commented 6 years ago

Let's just try your current scheme for naming IDL files.

How often does the pipeline run, and would it be possible to save enough data about where the snippets came from to add boilerplate at the top of IDL files in wpt/interfaces/ like Luke has been doing?

tidoust commented 6 years ago

How often does the pipeline run

Once per day for now.

would it be possible to save enough data about where the snippets came from to add boilerplate at the top of IDL files in wpt/interfaces/ like Luke has been doing?

Now done in Reffy but the auto-update script we have on the server seems to be broken (@dontcallmedom?), so note the boilerplate does not yet appear in the report.

tidoust commented 6 years ago

@foolip, @lukebjerring Note IDL files published in the report now contain the boilerplate. Let me know if you need anything else.

foolip commented 6 years ago

Awesome, thanks @tidoust!

tidoust commented 6 years ago

Closing this issue now. Feel free to raise more specific issues as needed.

foolip commented 6 years ago

Thank you, @tidoust! @dontcallmedom, is the updating run as a daily Travis cron job? The cadence of @dontcallmedom-bot makes it look like that, just checking.