Specify starting line and column for expression parser ParseExpr

ghost commented 3 years ago

One of the file formats in the application that I am developing uses Starlark for writing expressions. The Starlark expressions are surrounded by source code written in another language.

I want to specify a starting line number and column number when parsing expressions so that errors point back to the original source code.

Here's a suggested function for the syntax package.

// ParseExpr parses a Starlark expression in src. A comma-separated
// list of expressions is parsed as a tuple. Positions in the resulting 
// expression are based on pos.
func ParseExprPosition(pos Position, src []byte, mode Mode) (expr Expr, err error)

The proposed function does not have the usual filename, src pattern as in other functions in the syntax package because it's designed to work on a snippet from a larger file.

My feature request is for a way to specify the starting line and column number for the scanner, not for the specific function shown above.

My fallback plan is to rewrite positions in syntax.Error and starlark.EvalError using a source map that I maintain on the side. I can do that, but it seems that other users of Starlark expressions will have a similar requirement.

alandonovan commented 3 years ago

It's not necessary to change the API; you can do this by decorating the expression text with whitespace and parens. If you want an expression to start at line L, column C, then feed the parser (L-1)*"\n" + "("*(C-1) + expr + ")"*(C-1).

ghost commented 3 years ago

That works. Thank you.

It seems that it will be more common to pass a file snippet to syntax.ParseExpr than an entire file. It would be nice if the package supported the common scenario more directly than the suggested workaround. The workaround for the package limitation was not obvious to me. It may not be obvious to others.

Edit: I noticed from the comment history that your first attempt at the workaround was not correct. If you reject the request to add a new API, then please document the workaround for others.

alandonovan commented 3 years ago

I don't want to expand the API, but I agree this trick is worth recording in a comment at ParseExpr.

ghost commented 3 years ago

The workaround does not work when there are mismatched parentheses in the input.

_, err := syntax.Parse("x", "(", 0)
fmt.Println(err)  // prints x:1:2: got end of file, want primary expression

    // Wrap "(" with ((( ))) for input starting at column 4.
_, err = syntax.Parse("x", "(((()))", 0)
fmt.Println(err)  // prints x:1:8: got end of file, want ')'

I want position x:1:4 for the second snippet, but got x:1:8. Column 8 is outside the range of the original input,.

I am investigating two options:

rewrite positions in errors using a source map.
copy the relevant parts of the syntax package to my own package so I can make the fix.

alandonovan commented 3 years ago

Sorry, I should have elaborated. (In the worst case, wrapping in parens may even make an illegal expression valid, such as 1) + (2.)

One needs to parse twice, once to verify that the bare expression is valid, and once to generate the syntax tree with the correct positions. The errors from the first parse will have the wrong position, and will need to be offset by (L, C). Nonetheless, it's easily done using a helper function that need not belong in the syntax package.

ghost commented 3 years ago

it's easily done using a helper function that need not belong in the syntax package

My request is to add a way to specify the starting line and column for the scanner, not for the addition of a helper function that works around the limitation.

alandonovan commented 3 years ago

My request is to add a way to specify the starting line and column for the scanner, not for the addition of a helper function that works around the limitation.

Understood. But put another way, you're asking for an expansion of the API to solve a problem that can be solved with the existing API. The bar for such requests must be high.

ghost commented 3 years ago

My claim is that most users of syntax.ParseExpr need to map positions reported in errors back to an original source code location. The history in this thread shows that the workaround is tricky. I think my request meets a high bar given the common requirement and the nature of the workaround.

How about encoding the starting line and column in the filename?

If src != nil, ParseFile parses the source from src and the filename is only used when recording position information. The type of the argument for the src parameter must be string, []byte, or io.Reader. If the filename has the suffix :[0-9]+:[0-9]+, then the suffix is removed from the file name and the two numbers are used as the starting line and column for the source. If src == nil, ParseFile parses the file specified by filename.

Replace the : with \t, \n, or \x00 if there's concern that the suffix matches an actual filename.

Another idea is to support a new type in the src parameter:

If src != nil, ParseFile parses the source from src and the filename is only used when recording position information. The type of the argument for the src parameter must be string, []byte, io.Reader or interface { Src() []byte; Pos() Position }. If src == nil, ParseFile parses the file specified by filename.

I used an anonymous interface type to avoid adding more API to the package, but I think it would be better to declare a named type.

Both of these suggestions address the issue in all functions that use filename string, src interface{} arguments. In particular, these suggestions make the starlark.ExprFunc and starlark.Eval helper functions useful to applications like mine.

Edit: There are scenarios where the suggestions in this comment are useful for parsing statements. One example is a template engine that uses Starlark for control flow in addition to evaluating expressions. The Tornado template engine is an example of this idea from the Python world. The Tornado template engine transliterates template source to Python. As they say in their documentation, error-reporting is "interesting" with this approach. The ((())) trick does not work with statements, but I am sure you have a workaround for statements. The point I am trying to make is that embedding Starlark expressions and statements in some other language is useful. It will be great if the package can smooth over the rough edges for these scenarios.

alandonovan commented 3 years ago

most users of syntax.ParseExpr need to map positions reported in errors back to an original source code location. [...] One example is a template engine [...]

That's a good point and a good example.

How about encoding the starting line and column in the filename? [...] Another idea is to support a new type in the src parameter:

Nearly every string is a legal filename (only slash and NUL are disallowed) so I don't want this package to get into the business of interpreting those strings. I like your idea of supporting a new named type for the src parameter; this has the dual benefits of not adding new functions to the API, while also of allowing future extensibility should we ever need to add more options to the parser or scanner. For example:

type Source struct {
    Content []byte // content of file
    Filename string // name of file as it appears in errors and syntax tree
    Line, Column int // line and column of Content[0], typically (1, 1)
}

Ideally the zero value would be a sensible default for all fields of Source, but that would require specifying that Line and Column are each 1 less than the desired figure (in effect, LineDelta and ColumnDelta).

ghost commented 3 years ago

The Source type meets my needs , but I do have a some comments on the design:

The filename is available through the argument and Source. Removing the field from Source eliminates ambiguity over which value takes precedence. On the other hand, it's nice to have all of the source information clumped together in one struct.
I cannot think of a scenario where a (0, 0) is a valid value for (Line, Column). If (0, 0) is indeed invalid, then the scanner init function can convert (0, 0) to (1, 1).
LineDelta is added to all positions, but ColumnDelta is added to positions on the first line only. The name ColumnDelta may be confusing because the name might imply that the delta is applied to all positions.

alandonovan commented 3 years ago

You're quite right that filename creates redundancy and so doesn't belong. (It's too bad, because this Source data type would otherwise have made for a nicer API if we were starting over. There's an argument to be made that all public API functions should accept a struct whose zero value is acceptable, so that new fields can always be added in a compatible way. I wonder what a language would look like that unified the notions of function call arguments and structs.)

Let's assume that users will use this only in the special case (like yours) in which they need control over line/col. In that case, we can expect that they will be always provide the line/col values.

How about:


// A FilePortion describes the content of a portion of a file.
// Provide it as the src argument of ExecFile when the desired
// initial line and column numbers are not (1, 1).
type FilePortion struct {
    Content []byte
    FirstLine, FirstColumn int
}

ghost commented 3 years ago

The proposed FilePortion looks good.

I want to give you a heads up that we have a scenario where FirstColumn can be zero or negative. We add a prefix and suffix to user code snippets to create a valid statement for the parser. To align the user's file positions with the positions in the syntax tree, we will set FirstColumn to userFirstColumn - utf8.RuneCount(prefix). The user entered first column can be less than the length of the prefix.

google / starlark-go

Specify starting line and column for expression parser ParseExpr #346