Parsed `hclsyntax.Body` has slightly different range depending on what type of comment is at the top

rcjsuen commented 5 hours ago

I would expect this to print the same thing six times but the /**/ seems to behave differently if it's on the first line.

false
true
true
true
true
true

package main

import (
    "fmt"

    "github.com/hashicorp/hcl/v2"
    "github.com/hashicorp/hcl/v2/hclsyntax"
)

func containsInitialPos(content string) bool {
    file, _ := hclsyntax.ParseConfig([]byte(content), "", hcl.InitialPos)
    body, _ := file.Body.(*hclsyntax.Body)
    return body.Range().ContainsPos(hcl.InitialPos)
}

func main() {
    fmt.Println(containsInitialPos("/**/\ntarget{}"))
    fmt.Println(containsInitialPos("#\ntarget{}"))
    fmt.Println(containsInitialPos("//\ntarget{}"))
    fmt.Println(containsInitialPos("\n/**/\ntarget{}"))
    fmt.Println(containsInitialPos("\n#\ntarget{}"))
    fmt.Println(containsInitialPos("\n//\ntarget{}"))
}

rcjsuen commented 4 hours ago

I guess this is the same thing as #551...? 🤔

Noticed it while trying to do LSP stuff just like https://github.com/hashicorp/terraform-ls/issues/1052.

apparentlymart commented 2 hours ago

The "peeker" component is, when doing a normal parse, configured to filter out all comments (the boolean argument here is includeComments):

https://github.com/hashicorp/hcl/blob/3883feb0e06458153e8a1b56566582b6725330d2/hclsyntax/public.go#L22

This then interacts with a quirk of the grammar: the two single-line comment styles absorb their trailing newline because that's the marker that indicates the token has ended, but the multi-line comment style ends with */ and so anything that comes after it -- including a newline -- is not part of the comment token. (Notice that in the following only the first to productions actually include EndOfLine at the end.)

https://github.com/hashicorp/hcl/blob/3883feb0e06458153e8a1b56566582b6725330d2/hclsyntax/scan_tokens.rl#L68-L80

However, because newlines are sometimes significant in parts of the grammar (any time we're parsing the series of items in a body), the peeker has a special case where instead of dropping the single-line tokens entirely it quietly replaces them with TokenNewline:

https://github.com/hashicorp/hcl/blob/3883feb0e06458153e8a1b56566582b6725330d2/hclsyntax/peeker.go#L85-L102

With all of that in mind, if we remove the comment tokens in the same way the peeker would before entry into the parser, the six different inputs as observed by the parser are:

"\ntarget{}"
"\ntarget{}"
"\ntarget{}"
"\n\ntarget{}"
"\n\ntarget{}"
"\n\ntarget{}"

This is a somewhat-confusing outcome because it suggests that examples 1, 2, and 3 should be treated the same. However, I suspect that that the trick is in the comment in one of the code snippets I included above:

https://github.com/hashicorp/hcl/blob/3883feb0e06458153e8a1b56566582b6725330d2/hclsyntax/peeker.go#L91-L98

In examples 2 and 3, the initial TokenNewline has the byte sequence \n, but it has a source range covering the entire original comment, starting at byte 0. However, in example 1 the initial TokenNewline is a real \n rather than a weird trick, and so its source range is actually accurate and presumably starts at byte 5. Examples 4-6 all start with a "real" newline, regardless of any comments, so their first token also presumably has Byte: 0.

hcl.InitialPos is hcl.Pos{Line: 1, Column: 1, Byte: 0}, and hcl.Range.ContainsPos is really just asking whether the Byte of the given position is >= the Byte of the range's Start and < the Byte of the range's End. Therefore the first example (whose first parser-visible token starts at Byte: 5) does not contain hcl.InitialPos, while the others do.

Assuming all of the above is correct -- I've only read the code and imagined how it would behave with each input, not tried to validate this by executing it -- one possible way to make this work would be to decide that, despite what I wrote in that comment above, the complexity is justified to work to calculate the true location of the \n byte at the end of a single-line comment. That would then make 1-3 all return false and then 4-6 all return true, which is at least consistent, but not really what https://github.com/hashicorp/terraform-ls/issues/1052 seems to want.

I guess the need for https://github.com/hashicorp/terraform-ls/issues/1052 is to ensure that the root body resulting from a hclsyntax.ParseConfig always contains the given start position, even when the root body starts with a comment.

A different answer then is to change the rules for peeker.PrevRange in the special case where no tokens have been consumed yet (which is how the parser ultimately ends up deciding the "start" of the topmost body):

https://github.com/hashicorp/hcl/blob/3883feb0e06458153e8a1b56566582b6725330d2/hclsyntax/peeker.go#L65-L71

The if statement here is simply avoiding trying to access p.Tokens[-1] in the special case where no tokens have been consumed yet. The current rule is to just take the position of the first token instead, which is a fine answer unless the first token is a multi-line TokenComment that the peeker skips, because then that token gets excluded from the root body's range.

Instead then, perhaps the peeker should have a new field where it explicitly stores the starting position given to hclsyntax.ParseConfig (or any of the other similar hclsyntax.Parse* functions), and then peeker.PrevRange would just return that verbatim if no tokens have been consumed yet. That would allow the parser to keep the promise that the root body always starts at the initial parsing position, regardless of what the first token actually is.

This implies changing the signature of newPeeker to take the starting position as an argument, but all five of the functions that call it have a start hcl.Pos argument so that shouldn't be a problem.

I hope that's all useful!

hashicorp / hcl

Parsed `hclsyntax.Body` has slightly different range depending on what type of comment is at the top #702