tree-sitter / tree-sitter-haskell

Haskell grammar for tree-sitter.
MIT License
152 stars 36 forks source link

Help with creating a new parser! #49

Closed FoamScience closed 2 years ago

FoamScience commented 2 years ago

Hi guys,

First of all, thanks for the amazingly organized and commented code; couldn't find anything like this for a month!

So, I'm willing to use your external scanner as a base for my own, and I don't seem to get what's needed to compose a simple parser. My parser is straightforward, just keeps consuming characters and advances the lexer until it encounters "{", ";", "}" or white space; so I thought the following would work, but no 😢 :

bool non_identifier_char(const uint32_t c) { return iswspace(c) || eq(';')(c) || eq('{') || eq('}') || eq('$'); };
const bool non_identifier_chars(State & state) { return non_identifier_char(state::next_char(state)); };

// If identifier symbol is active, fail if not an identifier char
Parser identifier = sym(Sym::identifier)(iff(cond::non_identifier_chars)(fail));
// Do nothing else, just check for identifiers
all = identifier;

Can anyone help? Thanks in advance!

tek commented 2 years ago

hey there, happy to hear that the scanner is useful as a library!

Your parser only consumes one character, in order to do repeated parsing you'll have to use something like read_while. Additionally, if you want the scanner to produce a successful result with the characters that have been accepted, you'll need to call the finish combinator.

Your example could be expressed roughly like this:

Parser identifier = sym(Sym::identifier)(
  iff(non_identifier_chars)(fail) + 
  parser::read_while(!non_identifier_char) + 
  parser::finish(Sym::identifier, "some description")
);
FoamScience commented 2 years ago

Thanks for the help! That got me half of the way, but I still get a weird error.

I want to parse:

one line;

as ("one" and "line" are identifiers)

(identifier identifier)

However, with:

function<Result(State &)> read_while_parser(Condition pred) {
  return [=](State & state) {
    while (true) {
      if (state::eof(state)) break;
      uint32_t c = state::next_char(state);
      if (!pred(state)) {
            mark("identifier");
            break;
      }
      state::advance(state);
    }
    return Result(Sym::identifier, false);
  };
}
Parser identifier = sym(Sym::identifier)(
  //iff(cond::non_identifier_chars)(fail) + 
  read_while_parser(cond::identifier_chars) + 
  parser::finish(Sym::identifier, "Identifier")
);
Parser all = identifier;

I get ("l" gets eaten somehow!):

State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 3
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 3
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 3
State { syms = "identifier", indents = empty }
finish: Identifier
next: ;
result: identifier, 9
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 3
State { syms = "identifier", indents = empty }
finish: Identifier
next: ;
result: identifier, 9
State { syms = "identifier", indents = empty }
finish: Identifier
next: ;
result: identifier, 9
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 9
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 9
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 9
State { syms = "identifier", indents = empty }
finish: Identifier
result: identifier, 9

(ERROR [0, 0] - [1, 0]
  (identifier [0, 0] - [0, 3])
  (ERROR [0, 3] - [0, 5]
    (identifier [0, 3] - [0, 3])
    (ERROR [0, 4] - [0, 5]))
  (identifier [0, 5] - [0, 9]))

<ERROR>
  <identifier>one</identifier>

  <ERROR>
    <identifier></identifier>

    <ERROR>l</ERROR>
</ERROR>

  <identifier>ine;</identifier>
</ERROR>

which does not make much sense; any ideas?

You can view my code here in case you want to take a look at the grammar file.

Any help is much appreciated; Thanks in advance!

tek commented 2 years ago

it looks to me like it's parsing the empty string as a successful identifier because you commented out the check for the non-identifier character and aren't verifying that you've seen at least one identifier character in your custom parser!

FoamScience commented 2 years ago

Oh, man, It was right in front of my eyes; Thanks a lot!

tek commented 2 years ago

:sweat_smile: my pleasure!

414owen commented 2 years ago

@FoamScience I'd be interested in your experience with the scanner library, performance-wise. See https://github.com/tree-sitter/tree-sitter-haskell/issues/41

FoamScience commented 2 years ago

@FoamScience I'd be interested in your experience with the scanner library, performance-wise. See #41

I'm actually using only one parser from the scanner library and I never felt any slowness compared to what I was using before (I'v tested only files up to 4MB in size with my grammar though). To me, it seems like chaining parsers (or maybe just one of the ones the library provides) may be the cause of this issue.