leostera / caramel

:candy: a functional language for building type-safe, scalable, and maintainable applications
Apache License 2.0
1.05k stars 25 forks source link

parsing erlang terms #51

Open progman1 opened 3 years ago

progman1 commented 3 years ago

I run Erlang.Parse.from_file on https://github.com/erlang/otp/blob/master/lib/wx/api_gen/wxapi.conf

and get the error

failed: In wxapi.conf.copy, at offset 820: syntax error.

probably because the file defines terms to be read by file:consult/1 and is not appropriate to the front door of your parser. but with a different entry point it could parse terms?

leostera commented 3 years ago

Could you show me the file you're trying to parse?

Or an equivalent file that also breaks like this?

That'd help me see if there's anything that I know is currently unsupported by the Menhir parser or if we need to spend some time digging.

Thanks for opening the issue! 🙌🏼

progman1 commented 3 years ago

the link to it is above but here/s an excerpt:

%% %CopyrightEnd%

{const_skip, [wxGenericFindReplaceDialog, wxInvalidDateTime, wxLANGUAGE_KHMER]}.
  %% New enums needed for gl contexts not static numbers
  {'wx_GL_COMPAT_PROFILE',   {test_if, "wxCHECK_VERSION(3,1,0)"}},
leostera commented 3 years ago

Oh, sorry, I missed the link.

The parser I think will have trouble parsing that since its built to parse an entire Erlang module. I started the tree-sitter-erlang project to address some of these limitations, but I haven't yet integrated it into the erlang library.

You could try using that tree-sitter parser with something like ocaml-tree-sitter to get up and running. Else I'd be happy to either help you integrate the tree-sitter-erlang into the erlang library or rework the Menhir parser as we just landed a new AST here that is waiting to be used.

progman1 commented 3 years ago

I don't fully understand! Terms are part of the erlang language aren't they? What's the newest erl-parsetree.ml have on the old? I saw that the parser as-is had just the one entry point (very reasonably :). And I imagined that another entry point into the grammar could be added, one directly to a 'Terms' rule. Which may not be true if 'Term' syntax is not part of the erlang language itself....

You have the incremental parser menhir defnition - how come you're going after tree-sitter?

FYI, on staring at the format of the wxapi.conf for a while I got the impression it may not be a very regular syntax - a sort of lists of lists of lists affair that's ok for erlangs dynamic typing approach. Which suggested to me that I maybe shouldn't start hacking a yacc grammar for it! It also suggests to me that it isn't part of the erlang language as such since you already have a menhir grammar for erlang. I can't remember the limitations of LALR/LR grammars unfortunately.

What's your understanding? thanks.

leostera commented 3 years ago

@progman1 let me try to answer your questions :)

Terms are part of the erlang language aren't they?

Yes, they are.

And I imagined that another entry point into the grammar could be added, one directly to a 'Terms' rule.

We could make a new parser that reuses the expression language from the main parser, yes. This is because Menhir allows only one %start entrypoint.

how come you're going after tree-sitter?

The Menhir parser is only directly usable within OCaml code, the Tree-sitter parser can be used anywhere with tree-sitter bindings. This is Rust libraries, neovim, github Semantic. The Erlang community benefits more widely from this.

The lowest hanging fruit here would be to refactor erl_parser.mly into 2 parsers: erl_expr_parser.mly and erl_mod_parser.mly. Caramel continues then to rely on the Erlang.Parser.module_from_file/1 and you get a new Erlang.Parser.terms_from_file/1 that you can use to lift your config file into an Erlang.Ast.literal list.

The strong path forward is to do some work and integrate tree-sitter-erlang back into this repository, to use that as the term parser first. If that works, it'll be easier to start migrating the main parser to it.

progman1 commented 3 years ago

thanks for clarifying. I will tackle the low-hanging fruit! I have done some messing with menhir and something might be doable about entry points via converting to ocamlyacc grammar first, for an even lower hang!

progman1 commented 3 years ago

I have a parsed file :) happily, menhir does actually accept more than one start symbol. I had to do dangling commas in tuples and lists - maybe that isn't valid expression language after all? (I don't know if 'term' language is any different to expressions) the file also had multi-line strings which I took to mean should be stuck back together (macro stringification?) so a change there too.

if these are actually valid erlang then I'm happy to send up the patch?

leostera commented 3 years ago

Well I stand corrected! 🙌🏼 I didn't know that, thanks for showing me. Please send a patch 🎉 we can discuss the changes on the PR.