Open rcjsuen opened 5 hours ago
I guess this is the same thing as #551...? 🤔
Noticed it while trying to do LSP stuff just like https://github.com/hashicorp/terraform-ls/issues/1052.
The "peeker" component is, when doing a normal parse, configured to filter out all comments (the boolean argument here is includeComments
):
This then interacts with a quirk of the grammar: the two single-line comment styles absorb their trailing newline because that's the marker that indicates the token has ended, but the multi-line comment style ends with */
and so anything that comes after it -- including a newline -- is not part of the comment token. (Notice that in the following only the first to productions actually include EndOfLine
at the end.)
However, because newlines are sometimes significant in parts of the grammar (any time we're parsing the series of items in a body), the peeker has a special case where instead of dropping the single-line tokens entirely it quietly replaces them with TokenNewline
:
With all of that in mind, if we remove the comment tokens in the same way the peeker would before entry into the parser, the six different inputs as observed by the parser are:
"\ntarget{}"
"\ntarget{}"
"\ntarget{}"
"\n\ntarget{}"
"\n\ntarget{}"
"\n\ntarget{}"
This is a somewhat-confusing outcome because it suggests that examples 1, 2, and 3 should be treated the same. However, I suspect that that the trick is in the comment in one of the code snippets I included above:
In examples 2 and 3, the initial TokenNewline
has the byte sequence \n
, but it has a source range covering the entire original comment, starting at byte 0. However, in example 1 the initial TokenNewline
is a real \n
rather than a weird trick, and so its source range is actually accurate and presumably starts at byte 5. Examples 4-6 all start with a "real" newline, regardless of any comments, so their first token also presumably has Byte: 0
.
hcl.InitialPos
is hcl.Pos{Line: 1, Column: 1, Byte: 0}
, and hcl.Range.ContainsPos
is really just asking whether the Byte
of the given position is >= the Byte
of the range's Start
and <
the Byte
of the range's End
. Therefore the first example (whose first parser-visible token starts at Byte: 5
) does not contain hcl.InitialPos
, while the others do.
Assuming all of the above is correct -- I've only read the code and imagined how it would behave with each input, not tried to validate this by executing it -- one possible way to make this work would be to decide that, despite what I wrote in that comment above, the complexity is justified to work to calculate the true location of the \n
byte at the end of a single-line comment. That would then make 1-3 all return false
and then 4-6 all return true
, which is at least consistent, but not really what https://github.com/hashicorp/terraform-ls/issues/1052 seems to want.
I guess the need for https://github.com/hashicorp/terraform-ls/issues/1052 is to ensure that the root body resulting from a hclsyntax.ParseConfig
always contains the given start position, even when the root body starts with a comment.
A different answer then is to change the rules for peeker.PrevRange
in the special case where no tokens have been consumed yet (which is how the parser ultimately ends up deciding the "start" of the topmost body):
The if
statement here is simply avoiding trying to access p.Tokens[-1]
in the special case where no tokens have been consumed yet. The current rule is to just take the position of the first token instead, which is a fine answer unless the first token is a multi-line TokenComment
that the peeker skips, because then that token gets excluded from the root body's range.
Instead then, perhaps the peeker
should have a new field where it explicitly stores the starting position given to hclsyntax.ParseConfig
(or any of the other similar hclsyntax.Parse*
functions), and then peeker.PrevRange
would just return that verbatim if no tokens have been consumed yet. That would allow the parser to keep the promise that the root body always starts at the initial parsing position, regardless of what the first token actually is.
This implies changing the signature of newPeeker
to take the starting position as an argument, but all five of the functions that call it have a start hcl.Pos
argument so that shouldn't be a problem.
I hope that's all useful!
I would expect this to print the same thing six times but the
/**/
seems to behave differently if it's on the first line.