hoaproject / Compiler

The Hoa\Compiler library.
https://hoa-project.net/
453 stars 48 forks source link

%skip token selection #74

Closed SerafimArts closed 5 years ago

SerafimArts commented 7 years ago

In my task I need to know information about some tokens that have been marked as "%skip".

For example, here is the code:

%skip T_COMMENT \/\*.*\*\/
/** Docblock */ 
class /** skipped */ Test /** skipped */
{
    /** skipped */
}

Output:

#Class
    #Name 
        token(T_NAME, Test)

But Im need something like this:

#Class
  #Docblock
      token(T_COMMENT, /** Docblock */)
  #Name
      token(T_NAME, Test)

I can replace the %skip with a %token. But I will have to make it (%token) optional after every active token in the grammar file.

Any ideas how implement this?

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/48514991-skip-token-selection?utm_campaign=plugin&utm_content=tracker%2F330168&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F330168&utm_medium=issues&utm_source=github).
Hywan commented 7 years ago

Hello :-),

Either the token is skipped, and it will not appear in the token sequence, or the token is kept, and then it appears in the token sequence. Having a skip token in the token sequence is “not possible” (actually, it is possible, but this is not how it works).

Everything happens here: https://github.com/hoaproject/Compiler/blob/c86ccfbce9b9cad17cf84ffdf5c505c695d83d7a/Llk/Lexer.php#L205

and here:

https://github.com/hoaproject/Compiler/blob/c86ccfbce9b9cad17cf84ffdf5c505c695d83d7a/Llk/Lexer.php#L165-L168

It is still possible to introduce a new pragma to control the presence of the skip tokens inside the token sequence, but this is difficult, because we should modify both the lexer (easy part) and the parser (more difficult: We should add conditions everywhere to jumb over skip tokens, but still place them in the AST, not simple at all).

A %skip token is just a shortcut for %token skip. You can use them inside token namespaces.

What's your usecase exactly?

SerafimArts commented 7 years ago

What's your usecase exactly?

I did not want to go into details =( OK

Grammar here: https://github.com/SerafimArts/Railgun/blob/master/resources/grammar.pp (so big)

In real life, I can write files like this:

type User {
    id: ID!
    login: String!
    friends(count: Int = 10): [User]
    createdAt(format: String = RFC3339): String
    updatedAt(format: String = RFC3339): String
}

The GraphQL standard implies that all comments that are before the declaration should be used as a description:

# This is description of user
type User {
    # This is identifier of the user type object
    # Every user contain one unique identifier
    id: ID!
    # This is a name of the user
    login: String!
    # Will returns user friends
    friends(count: Int = 10): [User]
    # Created date of user
    createdAt(format: String = RFC3339): String
    # Last user activity time
    updatedAt(format: String = RFC3339): String
}

Each of these comments must fall into the type introspection. But GraphQL admits and such syntax:

# This is description of user
type User # TODO add avatar and email
{
    # This is identifier of the user type object
    # Every user contain one unique identifier
    id: ID! # TODO Change to UUID
    # This is a name of the user
    login: String!

...etc

In this case, the comments that are not placed where you want (# TODO ...) are simply ignored. I do not know how to implement this without problems

SerafimArts commented 7 years ago

A %skip token is just a shortcut for %token skip. You can use them inside token namespaces.

Little did not understand this thesis =\

Hywan commented 7 years ago

Before I forget, cc @Grummfy (https://github.com/Grummfy/PHPGraphQL/). He has the same idea to integrate hoa/compiler inside webonyx/graphql-php. I am not saying one should replace another, just making a connection here :-).

Grummfy commented 7 years ago

if I remember well I have finally find a way to deal with comments... but unfortunately due to time I can't dig into your code to help you now, so just look at my repo if it can helps ;)

Hywan commented 7 years ago

@SerafimArts OK, I understand your problem. Side note: Really interesting project, congrats! The grammar is a piece of art, I love the coding style.

Naive approach:

It sounds like two different tokens:

  1. %skip “newline . blanks* . (something that is not a newline)+ . blanks? #”, and
  2. %token comment “newline . blanks* . #”

It's pseudo-code-ish because I don't know how you manage your newlines.

Do you understand my approach?

SerafimArts commented 7 years ago

@Hywan Spaces have no meaning. For the first time I cited an example of a PHP code for the reason that it is possible to read Dobblock from Reflection (http://php.net/manual/en/reflectionclass.getdoccomment.php). And its behavior for obtaining documentation information are very similar to those that I have to implement.

If briefly description docblock is the comments that are: 1) Are in front of the tokens: type, interface, union, enum, scalar, input, directive. Like:

# This is docblock type description
type Example {}

# This is docblock interface description
interface Example {}

2) Which are before the description of fields, like:

type Example {
    # This is docblock description of field named "first"
    first: FieldType
   # This is another description of field named "second"
   second: Some
}

3) All other comments should be ignore, like:

# + This is a readable description block
# + This is a readable description block
type Example 
# x This is ignored comment
{
    # + This is a readable description block
    field: Some # x This is ignored comment
# x This is ignored comment
} # x This is ignored comment
Hywan commented 7 years ago

I perfectly understand your issue. I am trying something.

Hywan commented 7 years ago

This example is working:

%token word    \w+
%token comment #[^\n]*
%skip  ignored \h*(#[^\n]*)?\n?

#root:
    line()*

#line:
    ( <comment> #annotated_line )? <word>

Given the following data:

# c1
foo
   bar
# c2
baz
   qux # c3
eol

We obtain this:

$ hoa compiler:pp grammar.pp data -s -v dump
 #  namespace     token name            token value                     offset
-------------------------------------------------------------------------------
 0  default       comment               # c1                                 0
 1  default       word                  foo                                  5
 2  default       word                  bar                                 12
 3  default       comment               # c2                                16
 4  default       word                  baz                                 21
 5  default       word                  qux                                 28
 6  default       word                  eol                                 37
 7  default       EOF                                                       40

>  #root
>  >  #annotated_line
>  >  >  token(comment, # c1)
>  >  >  token(word, foo)
>  >  #line
>  >  >  token(word, bar)
>  >  #annotated_line
>  >  >  token(comment, # c2)
>  >  >  token(word, baz)
>  >  #line
>  >  >  token(word, qux)
>  >  #line
>  >  >  token(word, eol)

If a comment token is explicitly specified, then we have an #annotated_line node in the AST. Else, it fallbacks into the skip token.

There is one caveat though: Comments must start at column 0. I am working on this.

SerafimArts commented 7 years ago

@Hywan In this case, there is an intersection of tokens. I've already tried this.

SerafimArts commented 7 years ago

I updated the circuit, now the tests will pass - you will see an error: https://github.com/SerafimArts/Railgun/commit/c54671653cc51b37d211a6ab33822c51b77ef91f https://travis-ci.org/SerafimArts/Railgun/jobs/267208140

Hywan commented 7 years ago

Gotcha!

- %skip  ignored \h*(#[^\n]*)?\n?
+ %skip  ignored \h*(#[^\n]*)?[\n\h]*

I am trying with this data:

# c1
foo
   bar

   # c2

# c2b

baz
   qux # c3
eol

And this grammar:

%token word    \w+
%token comment #[^\n]*
%skip  ignored \h*(#[^\n]*)?[\n\h]*

#root:
    line()*

#line:
    ( comment() #annotated_line )? <word>

#comment:
    <comment>+

(note the special comment rule just to group many annotations)

And here is the result:

$ hoa compiler:pp grammar.pp data -s -v dump
 #  namespace     token name            token value                     offset
-------------------------------------------------------------------------------
 0  default       comment               # c1                                 0
 1  default       word                  foo                                  5
 2  default       word                  bar                                 12
 3  default       comment               # c2                                20
 4  default       comment               # c2b                               26
 5  default       word                  baz                                 33
 6  default       word                  qux                                 40
 7  default       word                  eol                                 49
 8  default       EOF                                                       52

>  #root
>  >  #annotated_line
>  >  >  #comment
>  >  >  >  token(comment, # c1)
>  >  >  token(word, foo)
>  >  #line
>  >  >  token(word, bar)
>  >  #annotated_line
>  >  >  #comment
>  >  >  >  token(comment, # c2)
>  >  >  >  token(comment, # c2b)
>  >  >  token(word, baz)
>  >  #line
>  >  >  token(word, qux)
>  >  #line
>  >  >  token(word, eol)
SerafimArts commented 7 years ago

I get an error in last ur example: Unexpected token "# c2b" (comment) at line 8 and column 1: # c2b

grammar:

%token foo      foo\b
%token bar      bar\b
%token baz      baz\b
%token qux      qux\b
%token eol      eol\b

%token comment  #[^\n]*

%skip  ignored  \h*(#[^\n]*)?[\n\h]*

#root:
    line()*

#line:
    (<comment> #annotated_line)? word()

#word:
    ::foo:: | ::bar:: | ::baz:: | ::qux:: | ::eol::

data:

# c1
foo
   bar

   # c2

# c2b

baz
   qux # c3
eol
Hywan commented 7 years ago
-    (<comment> #annotated_line)? word()
+    (<comment> #annotated_line)* word()
SerafimArts commented 7 years ago

Oops :D

SerafimArts commented 7 years ago

Such a regex should match any comment that starts on a separate line: ^(?:\h*#[^\n]*)|(?:\n\h*#[^\n]*)+: https://regex101.com/r/bppmDa/11/

I have absolutely no idea how to implement this, so that the tests are finally passed. It's easier to abandon the idea of supporting all the functionality of GraphQL =\ 1

Grammar here: https://github.com/railt/parser/blob/3df7977f7ba968aa623284cd331dcacef49b1667/resources/grammar.pp

I think the problem is that it indicated the validity of the comment in the root and it has a higher priority: https://github.com/railt/parser/blob/3df7977f7ba968aa623284cd331dcacef49b1667/resources/grammar.pp#L163

Document:
    Definition()*

Definition:
    # ...
    | UnionDefinition() 
    | Comment() # <= here
SerafimArts commented 7 years ago

Regex: (?:(?:\A|\n)\h*#[^\n]*)+\n\h*\b didnt works too: https://regex101.com/r/ox2BCl/1

Hywan commented 7 years ago

I would not add Comment inside Definition directly. I would add it where you can have actual annotations. Did you try this?

An alternative is to add <comment> everywhere. But everywhere is not in a lot of places actually: Only before a possible newline.

That's strange because the spec of GraphQL says that # … is a comment and should be ignored, but at the same time, it says it can serve as annotations, but only on specific structures of the language, right? That's really ambiguous… and so, I suggest to add comment everywhere where it is possible. That's not really convinient, but the GraphQL grammar is designed this way…

SerafimArts commented 7 years ago

An alternative is to add everywhere. But everywhere is not in a lot of places actually: Only before a possible newline.

I tried - it's incredibly slows down parsing code like this: https://github.com/railt/parser/blob/3df7977f7ba968aa623284cd331dcacef49b1667/tests/.resources/ast-ab-spec-tests/%2Bfull-comments.graphqls

That's strange because the spec of GraphQL says that # … is a comment and should be ignored, but at the same time, it says it can serve as annotations, but only on specific structures of the language, right? That's really ambiguous… and so, I suggest to add comment everywhere where it is possible. That's not really convinient, but the GraphQL grammar is designed this way…

Like PHP, JS+Flow and other languages ;)

PHP:

/** Test */
class /* asdasd */ Example {} // some
(new \ReflectionClass(Example::class))->getDocComment(); // -> "/** Test */"

JS:

/** @flow */
type A = String | Number; // Ignored comment
let a: Function = (example: A) => {};
SerafimArts commented 7 years ago

And this feature of the GraphQL is explained by the fact that the very idea of such a declaration of descriptions of types and fields was implemented already after the release of the first draft of the standard.

It began with Apollo client (https://github.com/apollographql/GitHunt-API/blob/01f91419f16c0eaf93ca74e64f270df8d9195f1b/api/sql/schema.js#L54), implemened here: https://github.com/graphql/graphql-js. And a proposal appeared after. Discussion here: https://github.com/facebook/graphql/pull/90#issuecomment-257002595 and below.

In fact this possibility is not documented, but it is supported by all popular implementations. Because ¯\_(ツ)_/¯

Hywan commented 7 years ago

I tried - it's incredibly slows down parsing code like this: https://github.com/railt/parser/blob/3df7977f7ba968aa623284cd331dcacef49b1667/tests/.resources/ast-ab-spec-tests/%2Bfull-comments.graphqls

It is slow if you use the tokens we have defined. If you use a simple token like %token comment #[^\n]+\n, it should be fast.

SerafimArts commented 7 years ago

It is slow if you use the tokens we have defined. If you use a simple token like %token comment #[^\n]+\n, it should be fast.

Let's assume. But in any case, such a grammar file will turn out to be unsupported (about 800 code lines currently)

SerafimArts commented 7 years ago

So at the moment the best and simplest option is the usual preprocessing of the source code (for example, converting # to ## in those places where such sections are allowed by the regex (?:(?:\A|\n)\h*#[^\n]*)+\n\h*\b and the definition of the rule %token T_DOCBLOCK ##.*)

Hywan commented 7 years ago

Let's assume. But in any case, such a grammar file will turn out to be unsupported (about 800 code lines currently)

Sorry, but I think I don't understand your point of view. Why most parts of the grammar will be unsupported?


I don't like this situation. It is what happens when the grammar is too weird unfortunately. In PHP, comments are marked as “skip”, except for docblock comment (/** … */), so there is a special kind of comments. Same for Javascript if I remind correctly.

The only alternative I see is: Either insert <comment> everywhere manually (I reckon it will not cover a lot of places), or ignore them for now, and add the support later. If I understand the problem correctly, GraphQL might add an annotation syntax in a near close future, right? So don't bother supporting a strange grammar if it changes in few months. From my own experience, implementing a whole grammar in one shot is difficult. You can postpone this task since it is not required to recognize valid GraphQL messages.

A pre-processor is a good idea, despite the performance impact. However, GraphQL payloads are generally of small sizes, so the impact should be minimal. Therefore, I think it is a good trade-off.

How can I help you now? I don't have all the time I would like to dig into your problem deeply unfortunately. I will do my best though if you have more questions!

SerafimArts commented 7 years ago

Sorry, but I think I don't understand your point of view. Why most parts of the grammar will be unsupported?

Large amount of code is always hard to read. And its increase and littering can lead to problems.

If I understand the problem correctly, GraphQL might add an annotation syntax in a near close future, right?

It is already implemented in the JS library, so it is unlikely =\

How can I help you now? I don't have all the time I would like to dig into your problem deeply unfortunately. I will do my best though if you have more questions!

I can suggest a solution to the problem - this is an opportunity to overload the "skip tokens" in the rules (group nodes), but as you said - it's huge problem:

%skip some ...
%token any ...

example_rule:
    <some>? <any>

So I think I think it's no more, thank you. =)

An opportunity to overload the "skip tokens" in the rules (group nodes)

Or provide an opportunity to declare a token that can be located anywhere.

flip111 commented 5 years ago

@SerafimArts is there anything you like to see changed in the hoa/compiler library? If so could you summarize/formulate this change? Or do you have more questions?

SerafimArts commented 5 years ago

@flip111 What I would like to see violates backward compatibility with existing code.

That's why I made a fork and completely rewrote almost all the code from scratch =)))

flip111 commented 5 years ago

@SerafimArts ok let's close this issue then. By the way i'm interested to see your fork. Do you have a link?

SerafimArts commented 5 years ago

You could find this fork here: https://github.com/hoaproject/Compiler/network

But it's not difficult for me to add links: FS Abstraction Layer: https://github.com/railt/io Lexer: https://github.com/railt/lexer Parser: https://github.com/railt/parser Compiler: https://github.com/railt/compiler

flip111 commented 5 years ago

@SerafimArts maybe you can mention hoa/compiler in the readme of railt/parser for saying thanks

SerafimArts commented 5 years ago

@flip111 Like this? https://github.com/railt/parser/blob/1.4.x/composer.json#L22-L29

Please create a pull request if you want more or some other format for mentioning Hoa. But it should be borne in mind that there is almost no code from the original.

This is the only place where you can find similarities: 1) Original 2) Fork