Closed tek closed 3 years ago
@rewinfrey @maxbrunsfeld @philipturnbull
Is there some way to reset the parser to a previous state? I looked at the C API but didn't find anything suitable.
As I understand tree-sitter's behavior there is no such option for manual controlling the parser's state. The parser makes this automatically and there only one option to correctly save the external scanner state through serialization and restore through deserialization. In the deserialization moment the external scanner gets the buffer with a state to which the parser decided to make the restore.
hm, bummer. I looked at how the C grammar handles #ifdef
s, but it's not apparent to me that there is a similar situation.
I naively imagine functionality in tree-sitter.h
where I can call get_current_state
and receive the state number, then later call shift(old_state)
and it will magically continue parsing :innocent: guess there's more to it than that
@tek regarding this PR contents:
$ git fetch origin refs/pull/29/head
...................................................................
* branch refs/pull/29/head -> FETCH_HEAD
$ git log --stat FETCH_HEAD
commit 5b7947c562b9b2b5d844c04351eacc331a6bbb86
structure grammar into multiple files
grammar.js | 1313 +-
grammar/basic.js | 98 +
grammar/class.js | 99 +
grammar/data.js | 147 +
grammar/decl.js | 81 +
grammar/exp.js | 241 +
grammar/id.js | 126 +
grammar/import.js | 30 +
grammar/misc.js | 12 +
grammar/module.js | 37 +
grammar/pat.js | 95 +
grammar/pattern.js | 32 +
grammar/type.js | 201 +
grammar/util.js | 66 +
src/grammar.json | 7391 +++--
src/node-types.json | 17 +-
src/parser.c | 298845 +++++++++++++++++++---------------------------------
17 files changed, 154611 insertions(+), 154220 deletions(-)
commit 62d7ebf592a9491d4d44273fceb1390b85d148b1 (refs/pull/29/head)
rewrite the grammar
...................................................................
63 files changed, 296573 insertions(+), 484667 deletions(-)
With such amount of changes it's too hard to understand what is an adaptation of the repo to the current tree-sitter version and what is improvements of the grammar. Also splitting the grammar.js to several files in my opinion is controversial change. For better chances to successful approval of changes to this repo by maintainers I'd suggest to split the work into several PRs:
Cleanup - Remove generated files from all the grammar repos in the tree-sitter org
corpus
to test/corpus
, etc.grammar.js
and related files.@ahlinc it's a complete rewrite, I wouldn't know how to split this according to your suggestion.
Why is it controversial to split the files?
Here the block inside of the
#ifdef
ends the current rule, but in the#else
it starts inside of that rule again. I can keep track of the previous state in the scanner, but I don't know how to deal with that for the grammar.
@tek I'm not Haskell expert so if you can give here a minimal correct version of a Haskell program with such preprocessor construction, may be I'll been able to help you with a way how to handle your preprocessor issue.
{-# language CPP #-}
f :: IO Int
f = do
_ <- do
pure ()
#ifdef foo
pure 1
#else
pure ()
pure 1
#endif
@ahlinc this should compile!
2. since there is no overlap with the original tests, how would that look?
Didn't know that, it's hard to understand in the huge PR.
Why is it controversial to split the files?
It's just my opinion, I didn't see such splits in any tree-sitter grammar repos and it's easier for me to navigate in a one file by a word search instead of several files. Also it seems that grammar.js
files looks too tricky for IDEs and I don't know any that would completely help with the navigation in them. Use of Typescript for tree-sitter grammars I didn't tried yet.
@tek thanks for the program snippet, I'll take a look today or this weekend.
I just now split the files after working on the project for a while, because it was quite unergonomic to navigate with your suggested word search, since lots of rules have strings like exp
and pat
in them. I'm not married to the idea, if it creates problems I'll change it back.
Why do you suspect this will be problematic with IDEs?
What about Typescript? Are you suggesting to use it?
@tek thanks for the program snippet, I'll take a look today or this weekend.
awesome, thanks!
At least coc-tsserver
can navigate to identifiers in the project (though since $
is used everywhere, that's very few)
Why do you suspect this will be problematic with IDEs?
For now I use VSCode as the main editor and has good completion in grammar.js
for DSL actions but at the same time it has problems in navigation between rules. At this moment I don't know what is the reason, may be I have some IDE misconfiguration or it doesn't understand completely grammar.js
constructions.
What about Typescript? Are you suggesting to use it?
I think about it as a way to solve my issues with navigation in the grammar.js
files because with Typescript I'll have a way to suggest to IDE about types in a place where it isn't clear for it. As consequence I'll need to call Typscript's tsc
compiler before each run of tree-sitter generate
.
For now I use VSCode as the main editor and has good completion in
grammar.js
for DSL actions but at the same time it has problems in navigation between rules. At this moment I don't know what is the reason, may be I have some IDE misconfiguration or it doesn't understand completelygrammar.js
constructions.
Is this a general problem or does it relate to this project, specifically the file separation?
In the former case, I can't imagine how an IDE would be able to navigate to rules since they are universally referenced as attributes of the $
object, which is unknown in the project due to being part of tree-sitter-cli
's machinery.
I think about it as a way to solve my issues with navigation in the
grammar.js
files because with Typescript I'll have a way to suggest to IDE about types in a place where it isn't clear for it. As consequence I'll need to call Typscript'stsc
compiler before each run oftree-sitter generate
.
So you would give the $
a type? Could you make that refer to top level attributes?
What do I need to configure for tree-sitter test
finding the language for a highlights test? I placed a haskell file in test/highlight
and added this to package.json
:
"tree-sitter": {
"scope": "source.hs",
"file-types": [
"hs"
],
"highlights": [
"queries/highlights.scm"
],
"injection-regex": "^(hs|haskell)$"
}
but it says No language found for path
.
It might have to do with finding the parser repo itself? https://tree-sitter.github.io/tree-sitter/syntax-highlighting#paths
This UX could probably be made smoother.
@maxbrunsfeld I have
"parser-directories": [
"/home/tek/code/tek/js"
],
in my config, and the repo path is /home/tek/code/tek/js/tree-sitter-haskell
. that should be correct, no?
I was able to break mentioned https://github.com/tree-sitter/tree-sitter-c C language parser with a simple C program where a directive splits function call rule. @tek this looks the same type of problem that you are trying to solve in this Haskell parser. Example:
#include <stdio.h>
#define FOO
int main(int argc, char const *argv[])
{
#ifdef FOO
printf
#endif
("Hello world!\n");
return 0;
}
With current tree-sitter abilities it looks for me impossible to implement correct tree-sitter parser in a one tree-sitter's language defined by a one grammar.js
file that would be able to process all possible combinations of rules and their splits by preprocessor rules. Even languages like C/C++ don't process preprocessor directives in the one parser, they splits such processing into two independent phases. The C/C++ preporcessor has one important feature that it decides that preprocessor directive in everything till the end of line and when preprocessor replaces condition directives it replaces them with \n
so it's impossible to make split of the grammar token what simplifies situation a bit, but I can imagine a language where some preprocessor directives may split anything, any token that consist more than a one character.
What I can suggest with current tree-sitter abilities is to:
All above is just my opinion and understanding how tree-sitter works and what limitations it has for now. Also I believe that it's possible to improve tree-sitter so it will be able to parse even such code with preprocessor constructions.
I had the same impression about the C grammar. Thanks for your suggestion with the multi-language feature, I'll try to implement something with that.
Howevery, I'd suggest to not make this part of this PR.
@maxbrunsfeld do you think it would be feasible to add functionality to the scanner API (or grammar) with which a state shift could be transparently enforced in order to reset to a previous preprocessor directive? Any other additional information that might help?
I think we’re always going to have to treat the preprocessor in an approximate way. It’s ok IMO if certain preprocessor patterns cause parse errors. Hopefully the errors will be recoverable and we’ll get useful results the vast majority of the time.
@maxbrunsfeld so if the parser would execute a shift
from an external stimulus, would it corrupt its overall state, or the parse tree?
those named precedences are quite the game changer!
question: what is the general policy when one rule can be substituted for another, but is more restrictive?
Like in this case, the simpletype
rule can be parsed by the btype
rule, but not all parts of the latter are legal in the positions where the former is used. The parse tree is identical, so is it praxis to use the more general rule, to avoid conflicts and be more permissive? or is it better to map the language precisely?
what is the general policy when one rule can be substituted for another, but is more restrictive?
I think that it is ok to parse a superset of the language if it helps you to avoid conflicts, or makes the grammar simpler. Sometimes I look at the generated parser's STATE_COUNT
also, to see which option will produce a smaller binary.
alright, all the TODOs are fixed, the realworld test projects that were included in the original now parse 100%! @maxbrunsfeld Please advise what else is needed for this to be merged.
I implemented the preprocessor directives so that the code in #else
blocks is just consumed by the cpp rule. I'll need to look deeper into how solution proposed by @ahlinc woud work, but not for this PR.
No idea what's going on with the travis config, does someone know how it was supposed to work?
1. **Unabbreviated Naming**
I saw that it was like that in the original, but I found it quite hard to parse, especially in the tests. There is a huge number of similarly named constructs like
qualified_symbolic_type_constructor
, is it really ergonomic to have those in parse trees? I haven't used tree-sitter for anything besides trying out the highlighting in neovim, so I can't relate to how it's commonly used.
I have quite a few inconsistencies in the structural node design, for example for promoted types, which can apply to a number of rules, I used the alias promoted
as a wrapper for the actual node, as in (promoted (type_literal (con_unit)))
– as opposed to promoted_con_unit
. Would that be sensible for others like qualified identifiers, as in (qualified (modid) (varid))
?
2. **Anonymous Operators**
I see, so those aren't useful to have named?
3. **Avoiding Excessive Wrapping** - * We'd often define something like `modid` (which I would suggest renaming to `module_identifier`) as an _alias_ of `conid`, instead of its own parent rule.
Even if the alias has to be used in multiple locations?
* We would try to structure the grammar so that we only create a `qtyconsym` (`qualified_type_constructor`?) if the symbol actually _is_ qualified (e.g. it has a parent module and a `.`).
that's already the case in some rules, I'll fix any that are remaining:
tyconsym: $ => $._tyconsym,
qtyconsym: $ => qualified($, $.tyconsym),
_qtyconsym: $ => choice($.qtyconsym, $.tyconsym),
thanks for the feedback!
I see, so those aren't useful to have named?
Even if they aren't named, they are still accessible in the tree, and you can still match on them with queries. I think it's just simpler to have fewer total names. There's something self-describing about the node's type just being =
.
Even if the alias has to be used in multiple locations?
Yeah, good question. To deal with this, we sometimes use a pattern like this:
inlines $ => [
$._module_identifier,
],
rules: $ => {
// Create an "inlined" rule that expresses the alias
_module_identifier: $ => alias($.constructor_identifier, $.module_identifier),
// Then, everywhere else, we refer to it by ☝️ this rule:
rule_a: $ => seq(
// ...
$._module_identifier,
// ...
),
rule_b: $ => choice(
// ...
$._module_identifier,
// ...
)
}
tests now run fine on github actions!
@maxbrunsfeld regarding the identifiers:
_aexp
, are named exactly like in the reference doc, in order to easily find the corresponding production rules there when debugging. Would you agree that this helpful to keep?
https://www.haskell.org/onlinereport/haskell2010/haskellch10.html _tyconid: $ => alias($.conid, $.type_constructor),
qualified_type_constructor: $ => qualified($, $._tyconid),
_qtycon: $ => choice($.qualified_type_constructor, $._tyconid),
type_operator: $ => $._tyconsym,
qualified_type_operator: $ => qualified($, $.type_operator),
_qtyconsym: $ => choice($.qualified_type_operator, $.type_operator),
import Foo (..)
, the two dots represent "all names", as opposed to import Foo()
, "only instances"; but omitting the dots from the tree would result in identical nodes. I would alias the dots as all_names
, is that acceptable?pattern_list
, pattern_strict
to be more readable than list_pattern
, strict_pattern
, since it allows me to parse the "category" of the nodes first, especially when there are many of them in succession, since it forms a regular pattern. What are your thoughts?This is just my personal opinion, but the long names would only be useful for me when I'm only beginning with the grammar.
Anything past that, they would harm readability - when I start working with something my brain falls into "pattern recognition mode", instead of "I'm genuinely reading and understanding what type_family_instance_declaration
means every time", and the long names hurt this - it takes more time to parse and get to the actually significant of the name.
Furthermore, the generated trees can get quite big and hard to trace, at least for me personally.
I think some abbreviated forms with a "abbreviations guide" intended for newcomers is a good compromise in this area.
(notably I haven't had much experience, so maybe I'm missing some crucial benefit to the long names)
I added some aliases and renames for tyconid
, conid
and top level declarations. Is this pattern more desirable?
@tek I’m a +1 on more human-targeted names for the public interface and using the Report’s grammar names for the hidden fields. Nice one 👍🏻
sounds like a compromise!
Interestingly enough, I can’t get this to build via npm install
on macOS… mysterious errors coming out of the C++ toolchain, even when specifying CXX=g++-9
or -10
. Curiously, tree-sitter test
and tree-sitter parse
work fine.
Really thrilled that this parses code written with -XBlockArguments. Amazing!
Really thrilled that this parses code written with -XBlockArguments. Amazing!
that was a tricky one :sweat_smile:
Interestingly enough, I can’t get this to build via
npm install
on macOS… mysterious errors coming out of the C++ toolchain, even when specifyingCXX=g++-9
or-10
. Curiously,tree-sitter test
andtree-sitter parse
work fine.
I had to move to GA because I couldn't get it to build on Travis. I'll try adding a macos action.
@tek Moving to GA is the right move anyway, since Travis is imposing OSS limits, and GA is better anyway.
I think the problem with npm install
on macOS may be a result of libc++ vs libstdc++, but I’m not sure.
it also happens in GA:
g++-9: error: unrecognized command line option '-stdlib=libc++'
any idea who sets this option?
I think that’s coming from somewhere inside Node or node-gyp. What’s really confusing is that using clang++ --std=c++17
produces, on my machine, errors that come from inside the functional
header itself:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1478:53: error: use of overloaded operator '!' is ambiguous
(operand type 'const function<bool (State)>')
bool __not_null(function<_Fp> const& __f) { return !!__f; }
^~~~
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1772:25: note: in instantiation of function template specialization
'std::__1::__function::__not_null<bool (State)>' requested here
if (__function::__not_null(__f))
^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1795:11: note: in instantiation of function template specialization
'std::__1::__function::__value_func<bool (const State &)>::__value_func<std::__1::function<bool (State)>, std::__1::allocator<std::__1::function<bool (State)> > >' requested here
: __value_func(std::forward<_Fp>(__f), allocator<_Fp>()) {}
^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:2491:50: note: in instantiation of function template specialization
'std::__1::__function::__value_func<bool (const State &)>::__value_func<std::__1::function<bool (State)>, void>' requested here
function<_Rp(_ArgTypes...)>::function(_Fp __f) : __f_(_VSTD::move(__f)) {}
^
../src/scanner.cc:685:69: note: in instantiation of function template specialization 'std::__1::function<bool (const State &)>::function<std::__1::function<bool (State)>, void>' requested here
Parser either(bool c, Parser match, Parser nomatch) { return either(const_<State>(c), match, nomatch); }
^
../src/scanner.cc:304:11: note: candidate function
Condition operator!(const Condition & c) { return [=](auto state) { return !c(state); }; }
^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/functional:1478:53: note: built-in candidate operator!(_Bool)
bool __not_null(function<_Fp> const& __f) { return !!__f; }
^
:dizzy_face:
oh, because I overload operator!
guess it only works for c++14
like that :(
I'll use a function instead
This isn’t in your code though, right? It looks like it’s my headers’ problem
But I can use <functional>
from an example test file without my headers dying… hmm 🤔
I'd say that when I override operator!(function<>)
, it will also be used in the stdlib when using a template from my code
like an orphan instance colliding with the default :smile:
hello :wave: I rewrote the grammar and it's working quite nicely. There's still some stuff to do, but at this point I'm opening this PR to get some advice on one specific problem that seems impossible to me.
The issue is preprocessor macros, as in:
Here the block inside of the
#ifdef
ends the current rule, but in the#else
it starts inside of that rule again. I can keep track of the previous state in the scanner, but I don't know how to deal with that for the grammar.Is there some way to reset the parser to a previous state? I looked at the C API but didn't find anything suitable.