Inline comments with # getting associated with the wrong source-code element?

I seem to be running into a parser issue, but can't be sure if this is a bug or as-intended. It seems buggy.

Code:

$code = '<?php

$foo = 1;

$foo !== 2; # just because

$bar = 2;
';

$nodes = $parser->parse($code);

var_dump($nodes);

Output:

array(3) {
  [0] =>
  class PhpParser\Node\Expr\Assign#10 (4) {
    public $var =>
    class PhpParser\Node\Expr\Variable#8 (3) {
      public $name =>
      string(3) "foo"
      private $subNodeNames =>
      NULL
      protected $attributes =>
      array(2) {
        'startLine' =>
        int(3)
        'endLine' =>
        int(3)
      }
    }
    public $expr =>
    class PhpParser\Node\Scalar\LNumber#9 (3) {
      public $value =>
      int(1)
      private $subNodeNames =>
      NULL
      protected $attributes =>
      array(2) {
        'startLine' =>
        int(3)
        'endLine' =>
        int(3)
      }
    }
    private $subNodeNames =>
    NULL
    protected $attributes =>
    array(2) {
      'startLine' =>
      int(3)
      'endLine' =>
      int(3)
    }
  }
  [1] =>
  class PhpParser\Node\Expr\BinaryOp\NotIdentical#13 (4) {
    public $left =>
    class PhpParser\Node\Expr\Variable#11 (3) {
      public $name =>
      string(3) "foo"
      private $subNodeNames =>
      NULL
      protected $attributes =>
      array(2) {
        'startLine' =>
        int(5)
        'endLine' =>
        int(5)
      }
    }
    public $right =>
    class PhpParser\Node\Scalar\LNumber#12 (3) {
      public $value =>
      int(2)
      private $subNodeNames =>
      NULL
      protected $attributes =>
      array(2) {
        'startLine' =>
        int(5)
        'endLine' =>
        int(5)
      }
    }
    private $subNodeNames =>
    NULL
    protected $attributes =>
    array(2) {
      'startLine' =>
      int(5)
      'endLine' =>
      int(5)
    }
  }
  [2] =>
  class PhpParser\Node\Expr\Assign#17 (4) {
    public $var =>
    class PhpParser\Node\Expr\Variable#15 (3) {
      public $name =>
      string(3) "bar"
      private $subNodeNames =>
      NULL
      protected $attributes =>
      array(3) {
        'comments' =>
        array(1) {
          [0] =>
          class PhpParser\Comment#14 (2) {
            protected $text =>
            string(16) "# just because\r\n"
            protected $line =>
            int(5)
          }
        }
        'startLine' =>
        int(7)
        'endLine' =>
        int(7)
      }
    }
    public $expr =>
    class PhpParser\Node\Scalar\LNumber#16 (3) {
      public $value =>
      int(2)
      private $subNodeNames =>
      NULL
      protected $attributes =>
      array(2) {
        'startLine' =>
        int(7)
        'endLine' =>
        int(7)
      }
    }
    private $subNodeNames =>
    NULL
    protected $attributes =>
    array(3) {
      'comments' =>
      array(1) {
        [0] =>
        class PhpParser\Comment#14 (2) {
          protected $text =>
          string(16) "# just because\r\n"
          protected $line =>
          int(5)
        }
      }
      'startLine' =>
      int(7)
      'endLine' =>
      int(7)
    }
  }
}

The comment trailing the $foo !== 2; expression (index 1) somehow gets associated with the $bar = 2; assignment (index 2) which seems odd.

Maybe the intention here is for doc-blocks to get correctly associated with the node following the doc-block... Are all comments deliberately associated with the following node whether they're doc-blocks or not?

That seems odd, and appears to be lead to orphaned last comment nodes - example:

$code = '<?php

$foo !== 2; # just because
';

$nodes = $parser->parse($code);

var_dump($nodes);

Output:

array(1) {
  [0] =>
  class PhpParser\Node\Expr\BinaryOp\NotIdentical#10 (4) {
    public $left =>
    class PhpParser\Node\Expr\Variable#8 (3) {
      public $name =>
      string(3) "foo"
      private $subNodeNames =>
      NULL
      protected $attributes =>
      array(2) {
        'startLine' =>
        int(3)
        'endLine' =>
        int(3)
      }
    }
    public $right =>
    class PhpParser\Node\Scalar\LNumber#9 (3) {
      public $value =>
      int(2)
      private $subNodeNames =>
      NULL
      protected $attributes =>
      array(2) {
        'startLine' =>
        int(3)
        'endLine' =>
        int(3)
      }
    }
    private $subNodeNames =>
    NULL
    protected $attributes =>
    array(2) {
      'startLine' =>
      int(3)
      'endLine' =>
      int(3)
    }
  }
}

The comment is lost.

I don't know how you'd get around that, since comments are associated forwards with the next node, if there is no next node. I guess you'd need something like an "end of file" node with which the orphaned trailing comment could be associated?

Even so, that would seem kind of patchy, since what's causing the problem in the first place, is the fact that comments aren't treated as nodes, but as "attributes" of nodes - which is true of /** doc-blocks; these are attributes of the following node - but other types of comments really aren't.

Perhaps this is consistent with the native PHP lexer/parser, and if so, an "end of file" node might be the most reasonable solution (?)

I'm using the release 1.3.0 version.

Oddly, even if I insert a junk null statement at the end of the file, the comment doesn't get associated with that either.

$code = '<?php

$foo !== 2; # just because

;null;
';

$nodes = $parser->parse($code);

var_dump($nodes);

Output:

array(2) {
  [0] =>
  class PhpParser\Node\Expr\BinaryOp\NotIdentical#10 (4) {
    public $left =>
    class PhpParser\Node\Expr\Variable#8 (3) {
      public $name =>
      string(3) "foo"
      private $subNodeNames =>
      NULL
      protected $attributes =>
      array(2) {
        'startLine' =>
        int(3)
        'endLine' =>
        int(3)
      }
    }
    public $right =>
    class PhpParser\Node\Scalar\LNumber#9 (3) {
      public $value =>
      int(2)
      private $subNodeNames =>
      NULL
      protected $attributes =>
      array(2) {
        'startLine' =>
        int(3)
        'endLine' =>
        int(3)
      }
    }
    private $subNodeNames =>
    NULL
    protected $attributes =>
    array(2) {
      'startLine' =>
      int(3)
      'endLine' =>
      int(3)
    }
  }
  [1] =>
  class PhpParser\Node\Expr\ConstFetch#12 (3) {
    public $name =>
    class PhpParser\Node\Name#11 (3) {
      public $parts =>
      array(1) {
        [0] =>
        string(4) "null"
      }
      private $subNodeNames =>
      NULL
      protected $attributes =>
      array(2) {
        'startLine' =>
        int(5)
        'endLine' =>
        int(5)
      }
    }
    private $subNodeNames =>
    NULL
    protected $attributes =>
    array(2) {
      'startLine' =>
      int(5)
      'endLine' =>
      int(5)
    }
  }
}

The comment before the junk null statement is simply dropped.

I really think that attempting to associate comments as attributes of nodes is a mistake - comment nodes probably should have been represented as just comment nodes. Establishing the association of comments as attributes of (e.g.) class/interface declarations, doesn't really seem like a parser concern? With the possible exception of doc-blocks which do proceed a declaration that can have the previous doc-block applied as an attribute... (?)

Regarding the null statement, the problem is the first ; in front of ;null; -- The comments will be associated with the ; statement, which is not represented in the AST. If you write null; instead the comments should turn up there.

Regarding the actual problem: Yes, you're correct in saying that assigning comments as attributes of the next node is a rather inaccurate representation. The thing is that this parser constructs an AST, where the "A" is for "abstract", which means it's not supposed to include comments or whitespace information. Comments are primarily collected for the sake of doc comments, which are often important for analysis (and in the case of annotations are really part of the code).

Comments can turn up virtually anywhere, e.g. Foo /* foo */ \ /* bar */ Bar is a valid namespace name and I wouldn't know how one would represent something like that without switching to a different (CFT) tree structure.

When I'm doing source-to-source transformations that are supposed to preserve comments and whitespace, I do it by enabling file/token offsets in the lexer and directly modify the code/tokens based on the offsets stored in the ASTs. Haven't found a way to reliably to these transformations directly on the AST yet.

Yeah, I see the problem. Hmm, too bad.

I can tell you what I had in mind - I think it's a pretty cool idea, I got it from Nim's test approach, and discovered that this is valid syntax and possible in PHP as well.

test(
    'My test description',
    function () {
        $foo = 1;
        $bar = 2;

        $foo < $bar; # foo is less than bar
        $foo + 1 === $bar; # and so forth
    }
);

The test() function would obtain the source-code of the closure (I got that working) and then process any stand-alone expressions, generating an assertion for each of them, e.g. transforming the function body into something like:

test(
    'My test description',
    function () {
        $foo = 1;
        $bar = 2;

        ok($foo < $bar, '$foo < $bar', 'foo is less than bar');
        ok($foo + 1 === $bar, '$foo + 1 === $bar', 'and so forth');
    }
);

From which the test-runner output would be:

=== My test description ===

* OK: foo is less than bar: $foo < $bar
* OK: and so forth: $foo + 1 === $bar

I love this test format - not having to write assertion statements, but just simply writing stand-alone expressions with comments, would make for a very low-noise test format. You could even eliminate most of the comments by just giving meaningful names to variables, which would make the test (and output) practically self-explanatory.

For that matter, you could even place the tests in external files, eliminating the use of closures and any test-framework API at all :-)

Perhaps I can use the Lexer stand-alone without the Parser to implement this? At least that would be better than token_get_all().

Do you know of any other implementation of a CFT for PHP?

Or would it make sense to introduce an additional layer in your parser? (not asking you to do the work here of course, just asking if you think it's feasible - or would the additional intermediary CFT model cause too much CPU and memory overhead?)

So, here is how I would go about your particular case: In the lexer, enable token end positions. Get the tokens from the current file via $lexer->getTokens(). Now you should get the trailing comment of a node using something similar to this (not tested):

function getTrailingComment(array $tokens, Node $node) /* : ?string */ {
    assert($node->hasAttribute('endTokenPos'));

    $pos = $node->getAttribute('endTokenPos');
    $endLine = $node->getAttribute('endLine');

    for (; $pos < count($tokens); ++$pos) {
        if (!is_array($tokens[$pos])) continue;
        list($type, $content, $line) = $tokens[$pos];
        if ($line > $endLine) break;
        if ($type === T_COMMENT || $type === T_DOC_COMMENT) {
            return $content;
        }
    }

    return null;
}

Perhaps I can use the Lexer stand-alone without the Parser to implement this? At least that would be better than token_get_all().

The lexer is just token_get_all() with a few compatibility hacks for different PHP versions ;)

Do you know of any other implementation of a CFT for PHP?

I think https://github.com/grom358/pharborist might be doing this. I'm not really sure about this, because the project sadly doesn't have documentation.

Or would it make sense to introduce an additional layer in your parser? (not asking you to do the work here of course, just asking if you think it's feasible - or would the additional intermediary CFT model cause too much CPU and memory overhead?)

Having an additional parse tree layer would indeed be problematic for perf. I can only see this as a separate mode of operation. But I've never looked into this much.

Any chance this improved with the 4.0 release?

Perhaps comments will be preserved like white space without modifications?

@mindplay-dk If formatting-preservation is enabled, then comments will be kept without modification, yes. The representation of the comments in the AST has stayed the same though.

nikic / PHP-Parser

Inline comments with # getting associated with the wrong source-code element? #201