vanderlee / Comprehend

PHP 5/7 framework for creating complex scanners, lexer, tokenizers and parsers based on BNF (and variant) formal syntax definitions.
MIT License
5 stars 2 forks source link

Questions #5

Open flip111 opened 5 years ago

flip111 commented 5 years ago

Hello, i have some questions about your library

  1. Could you explain the pros and cons of the two coding styles object and ruleset ?
  2. Could you provide a example of reading in a BNF file and parsing some content?
  3. Is the library compatible with EBNF? I have one EBNF file which uses { some rules}- (notice - on the end) to denote one-or-many and also ANY - 'literal char' where - is the exception symbol. So when the literal char is C for example it would be the same as regex [^C]. There is also the syntax 4 * RULE for denoting a fixed number qualifier. Is there a special token for end of file?
  4. Is it possible to access/traverse the grammar nodes?
  5. can you put the state of the library alpha/beta/... in the readme? Can you also point to the docs directory (even though there is not much content yet). Generated API docs would be nice, but not urgent.
  6. is it possible to dumb the content AST after it's been read? Similar to https://hoa-project.net/En/Literature/Hack/Compiler.html#Compilation_process and https://hoa-project.net/En/Literature/Hack/Compiler.html#Abstract_Syntax_Tree
  7. Is it possible to get a token list with offsets? Similar to https://hoa-project.net/En/Literature/Hack/Compiler.html#Namespaces
  8. Is it possible to get traces which rules have been considered and found to match or not to match? Similar to https://hoa-project.net/En/Literature/Hack/Compiler.html#Traces
  9. Not really a question but since i'm refering to hoa\compiler manual a lot. I just notice that it would be great to be able to generate random inputs based on the grammar. https://hoa-project.net/En/Literature/Hack/Compiler.html#Generation The property random AST -> pretty print -> parser -> AST should hold
  10. How are the error messages? Do they show which rule fails? Does it show which tokens would be a valid choice in that position?
  11. Is it possible to make custom error messages in places where a particular rule does not match?
  12. is there any detection for infinitive loops, for example ( some_rule* )+, + must match at least one, but the inner rule is allowed to match nothing.
  13. Can you put some benchmarks of different sizes of grammar and input and perhaps compare to some other libraries?
  14. Are tokens literals only? Can they be regular expressions? Sometimes several rules can be folded into one big regular expression (probably a trade off with some other features), in general this will be a massive speedup because PCRE is really fast in comparison to logic in PHP itself. It would be interesting to detect these cases and rewrite rules.
  15. Can you explain what you mean with the difference between lexer and tokenizer in my mind they are the same, but in the library description they are separate things.
  16. Can an example be given how to generate a lexer, a tokenizer and a parser separately? Or perhaps i misunderstood and they must always be created together?
  17. Are the created lexer/parser compiled to PHP code, or must the grammar be interpreted without compilation phase?

The library looks in an early stage so i realize most of the things i ask for are likely not available. But perhaps some things are. Also it's possible that some things will never be included in the design of the parser. Maybe it's possible to already comment on some of the points even though it's early.

Thanks for putting your creation online :)

vanderlee commented 5 years ago

Hi, Sorry for not replying sooner, but I'm swamped with personal business. I'll get to the full replies soon.

For now a few quick ones:

  1. Ruleset is easier to manage, as it automatically resolves recursion. With objects, you need to handle recursion yourself using "Stub" (placeholder) objects. Note that rulesets are largely based on objects too.
  2. This framework does not parse BNF (yet, it's planned), but lets you create your own parsers by transcribing BNF into it's own objects.
  3. Yes. You'll get a tree of every token parsed. In order for the tokens to be identified by name you can set one yourself or use a ruleset to name tokens automatically.
  4. Currently Beta due to missing docs mostly. It's got very high unittest coverage and should be very stable, but it is sadly still mostly undocumented.
  5. Currently not. This is actually a significantly more difficult problem than it appears at first. It should be possible to use the token mechanism for this. Tokens were only recently added, so I'm still figuring out how to fully utilize it.
  6. Assuming errors based on tokens is a working solution, appending error messages to those tokens should be trivial.
  7. I'm assuming by "token" you mean a terminal like a character or word. If so then yes; a regex terminal is included and is in fact probably the best choice in most practical situations (though not necessarily the best performing).
  8. Put in the most basic terms, a tokenizer breaks the input into parts, a lexer identifies the parts. In terms of the Comprehend framework (and most parser frameworks), they are the same thing. If you attach "token names" (I know the naming seems confusing) to the parser rules, the tokenizer becomes a lexer.
  9. Not currently. It's something on my wishlist, but has a low priority. You should probably look at YACC/Lexer/Bison/Flex PHP conversion for this.

I'm not familiair with Hoa as it covers a different use case.

flip111 commented 5 years ago
1 -> 1
2 -> 2
4 -> 3
5 -> 4
? -> 5
10 -> 6
14 -> 7
15 -> 8
? -> 9

I guess this is how the questions map to the answers, i'm not sure to which question answer 5 and 9 belong to.

vanderlee commented 5 years ago

The numbers map to the same numbers you used

Op za 3 nov. 2018 21:04 schreef flip111 <notifications@github.com:

1 -> 1 2 -> 2 4 -> 3 5 -> 4 ? -> 5 10 -> 6 14 -> 7 15 -> 8 ? -> 9

I guess this is how the questions map to the answers, i'm not sure to which question answer 5 and 9 belong to.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vanderlee/Comprehend/issues/5#issuecomment-435616896, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnoGH61x8Hb8A-QtDgjIJCssrb3wdTcks5urfbXgaJpZM4XysH8 .