hexops / vecty

Vecty lets you build responsive and dynamic web frontends in Go using WebAssembly, competing with modern web frameworks like React & VueJS.
BSD 3-Clause "New" or "Revised" License
2.79k stars 143 forks source link

Switch from MDN-doc-scraper to custom open data format #136

Open slimsag opened 6 years ago

slimsag commented 6 years ago

Right now our generated packages (elem, prop, etc) are created by scraping the MDN documentation website and pulling relevant information. At first, this gave us good coverage of the entire DOM API and worked well, but now it is a waste of time and gives us very inaccurate results.

MDN Background

As time went on, I noticed that a significant portion of the pages on the MDN are in very inconsistent formats which makes the docs extremely hard to scrape for information accurately:

To resolve the above issues, I spent upwards of 80+ hours contributing to the MDN in order to resolve these issues. I found the best layouts on the most popular MDN pages, and ensured other pages follow that same style consistently in page layout and wording.

Unfortunately, this was mostly in vain. The MDN is a bit like the wild-west: anyone can make changes if they have a GitHub account, without any peer review(!), and anyone can revert changes without any peer review(!).

Although I made over ~85 pages use a consistent layout to the rest of the MDN and clearly documented my changes as doing this, almost 18 of those pages were reverted by another MDN contributor without any reason mentioned in the history. I tried to reach this contributor via the IRC channel and mailing lists, as he had no public contact information, but came up with still no way to contact this contributor after several weeks.

With no way to contact this contributor, I made attempts to change the page layout on a few of those pages again and directly mentioned in the changelog that his revert made the page not follow the consistent style used on other popular MDN pages, and that I was trying to adopt a consistent MDN format. Again, the changes were reverted.

Better approach

The MDN's content license is permissive enough for us to use their documentation in our godocs, and so I think we should use an alternative method of generating our packages from (initially) the MDN documentation.

What this would look like is creating a separate Vecty repository, maybe github.com/vecty/webdoc with some type of file format (YAML, XML, etc) that documents individual web APIs (objects, data types, function signatures, docstrings, etc) for use in our generators.

This is similar in concept to how Glow, a Go OpenGL binding generator I have worked on, operates.

pdf commented 6 years ago

Any thoughts on perhaps using WhatWG documentation instead? Producing a locally maintained custom representation for an evolving specification seems like a pretty serious commitment over time.

slimsag commented 6 years ago

Yes, that is something to look into.

On Sep 2, 2017 6:39 PM, "Peter Fern" notifications@github.com wrote:

Any thoughts on perhaps using WhatWG documentation instead? Producing a locally maintained custom representation for an evolving specification seems like a pretty serious commitment over time.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gopherjs/vecty/issues/136#issuecomment-326778915, or mute the thread https://github.com/notifications/unsubscribe-auth/ADBrOPf2RWz8rwsahVJ1Ov8CB-vB1M4Xks5segNNgaJpZM4PLCuq .

dmitshur commented 6 years ago

That MDN experience sounds awful. It's not acceptable to revert a reasonable change without giving a reason/rationale, much less so doing without any contact information or due process. I'll definitely take it into account in the future and be unlikely to contribute myself.

On the topic of alternative sources, is there anything like WebIDL available for these APIs?

When I was thinking about writing a Go wrapper for WebGL 2 API, the best source of its API I found was the specification, expressed as WebIDL. E.g., see webgl2.idl.

pdf commented 6 years ago

On the topic of alternative sources, is there anything like WebIDL available for these APIs?

I had a brief look the other day, and AFAICT since HTML5, IDL definitions are pretty sparse, and don't come at all close to covering the full spec.

dmitshur commented 6 years ago

If not WebIDL, what do browsers use as reference to implement these things?

pdf commented 6 years ago

Luck ;-). I also wondered this, and went looking for test suites, what I found looked like a total shambles of hand-written stuff.

slimsag commented 6 years ago

If we can find something that outlines the APIs signatures (symbol names and data types), then we can use a more additive approach (i.e. to ensure we have good coverage of the ever-changing API)

dmitshur commented 6 years ago

Relevant news: https://blogs.windows.com/msedgedev/2017/10/18/documenting-web-together-mdn-web-docs/. /cc @slimsag

Hopefully the consolidation results in improvements to quality and consistency of Web docs.

slimsag commented 6 years ago

Since the Blink repository is over 5 GB and cloning it takes quite a while, I've created a subrepo that will host just the *.idl files from the blink repository and created a little Go script to update the repository. Others (/cc @myictv ) may find this useful.

https://github.com/vecty/blink-idl

slimsag commented 6 years ago

@myitcv (spelled the name wrong)

myitcv commented 6 years ago

@slimsag thanks very much for the cc

pdf commented 6 years ago

That's a good find, much better coverage than anything I found.

slimsag commented 6 years ago

Yeah my research basically uncovered that those blink IDL files would provide:

  1. All of the JS type names (HTMLBodyElement, equivilents for svg, etc).
  2. All of their properties (href for an HTMLAnchorElement, title, etc).

But it's not all perfect. We would need:

  1. Documentation for those types and properties (like what the MDN has, but preferably in Go style because right now we do a lot of hacks to reword MDN documentation to match Go style).
  2. Some form of mapping from JS type name (HTMLBodyElement) -> HTML tag name (body).
  3. A way to actually parse those IDL files (a language with inheritance, etc. in itself).

I think the IDL files will be good for validating that we cover the entire (moving) spec going forward. But not good for producing the actual documentation, etc. This will probably be some mixture of a scraper like what we have today for the MDN and manual work -- I'm not sure.

Also whatwg has a 'developer edition' (targeting web developers) but valuable / concise information there seems sparse (although the documentation for events seems quite good) https://html.spec.whatwg.org/dev/indices.html#index

progrium commented 5 years ago

I don't think we should let recreating documentation get in the way of this. We can link to a relevant page from the Godocs. Most are going to be self explanatory to any web developer. Template systems don't document every HTML element they support. Another IDL we can use is the TypeScript definition which is pretty compact and parsable.