tree-sitter / tree-sitter-html

HTML grammar for Tree-sitter
MIT License
132 stars 73 forks source link

Documentation #5

Closed Turbo87 closed 5 years ago

Turbo87 commented 5 years ago

After reading the official tree-sitter docs I'm now trying to understand the implementation in this project. Unfortunately the custom scanner doesn't seem to be documented and I'm wondering what it's purpose is, why it's needed and if it would need to be extended to support something like Handlebars.

maxbrunsfeld commented 5 years ago

Yeah, sorry for the lack of documentation around that. As you probably could tell, scanner.cc is a hand-written source file, unlike parser.c which is generated based on the grammar. It's called an "external scanner", and it's used in Tree-sitter parsers where you need a little bit of logic that can't be expressed in the context-free grammar + regular expression format.

In the case of HTML, we use it to implement tag-name matching, as well as HTML's idiosyncratic logic for which tags can be self-closing, etc.

For handlebars, I don't think you'd need to modify tree-sitter-html in any way. I think you'd want to create a new parser, somewhat like tree-sitter-embedded-template (which parses the templating language used by EJS and ERB: <% and %> tags, etc). The new parser (let's call it tree-sitter-handlebars) would just be responsible for parsing handlebars tags, not the underlying HTML.

Then, to parse a handlebars template, you would first parse the file with tree-sitter-handlebars. Then, you would take that syntax tree and find the ranges of all of the content nodes (nodes that represent chunks of text content between the handlebars tags), and parse those ranges using tree-sitter-html.

Tree sitter's includedRanges API allows you to parse a set of disjoint ranges in a document. That's how we parse things like EJS and ERB in Atom today. Does that make sense?

Turbo87 commented 5 years ago

Then, to parse a handlebars template, you would first parse the file with tree-sitter-handlebars. Then, you would take that syntax tree and find the ranges of all of the content nodes (nodes that represent chunks of text content between the handlebars tags), and parse those ranges using tree-sitter-html.

Does that make sense?

I understand your proposal, I'm just not sure yet if it's the best way forward. Handlebars is commonly used in two way:

  1. a templating system using string interpolation implemented in a variety of languages and used in not just HTML, but all sorts of files
  2. as the templating layer for Ember.js where the Handlebars implementation is HTML-aware and compiles to bytecode instead of doing string interpolation

for 1. a simple tree-sitter-handlebars plugin would probably be sufficient, but for 2. it would be preferable to have a unified AST in the end that supports both HTML, and the Handlebars subset used by Ember.js (no partials, etc).

just to give you an idea of typical Ember.js template code:

<div class="is-car {{if isFast "zoooom" "putt-putt-putt"}}">
  {{car-component car=model}}
</div>

as you can see it's possible to use Handlebars bindings inside of HTML element attributes, and if used like class={{someBinding}} I would assume the HTML parser would return an error because of the missing attribute value?

maxbrunsfeld commented 5 years ago

Missing attribute values are ok, so we wouldn’t get an error there.

I see your point about modeling embers handlebars implementation more exactly. It’s definitely doable, but would require duplicating most of the code in this repo. And it still seems like you need a different approach for the more general usage of handlebars as a template language.

Duplicating some of this logic is not a huge deal, but it might be worth trying the simpler approach first, and seeing if you really need to model it all as one language.

maxbrunsfeld commented 5 years ago

External scanners have been documented: http://tree-sitter.github.io/tree-sitter/creating-parsers#external-scanners.