google / starlark-go

Starlark in Go: the Starlark configuration language, implemented in Go
BSD 3-Clause "New" or "Revised" License
2.26k stars 204 forks source link

FR: Use StringDict-like return value for Thread.Load #463

Open ashi009 opened 1 year ago

ashi009 commented 1 year ago

I'm implementing a stardoc-like program (extracting docs from bzl files). In such a program, we don't really need to know the concrete types of foreign symbols. For now, the Thread.Load returns a StringDict, which means I need to resolve the loading bzl path, and exec that file just to make the Exec work.

I'm proposing that let Thread.Load returns a StringDict-like interface with Get and Keys methods instead. Which will allow users to mock a load.

The Thread.Load is used as follow:

https://github.com/google/starlark-go/blob/4b1e35fe22541876eb7aa2d666416d865d905028/starlark/interp.go#L566-L587

adonovan commented 1 year ago

I'm implementing a stardoc-like program (extracting docs from bzl files). In such a program, we don't really need to know the concrete types of foreign symbols.

That's not true in general. If you intend to produce the documentation for all bindings in a module as a dynamic analysis--i.e. by loading the module, causing the execution of all its top-level statements including possible side effects on the host application, or even nontermination--then the exact definitions of everything loaded do potentially matter. (I think the Bazel documentation tools made a design mistake by taking the dynamic route.)

Why not produce documentation statically, that is, by parsing the source file and looking for the statements that define top-level variables? This route is guaranteed to produce its answer deterministically, efficiently, and robustly.

Loosely related:

ashi009 commented 1 year ago

I'm implementing a stardoc-like program (extracting docs from bzl files). In such a program, we don't really need to know the concrete types of foreign symbols.

That's not true in general. If you intend to produce the documentation for all bindings in a module as a dynamic analysis--i.e. by loading the module, causing the execution of all its top-level statements including possible side effects on the host application, or even nontermination--then the exact definitions of everything loaded do potentially matter. (I think the Bazel documentation tools made a design mistake by taking the dynamic route.)

Correct, but good enough in terms of doc generation.

Why not produce documentation statically, that is, by parsing the source file and looking for the statements that define top-level variables? This route is guaranteed to produce its answer deterministically, efficiently, and robustly.

Although Starlark is designed to replace the way-too-flexible python used in the early versions of Bazel, the core concepts like rule, aspect, and provider remain dynamically defined at runtime.

Static analysis won't work for those dynamically defined constructors (and they are not rare!) That's why Stardoc is considering building the doc gen into bazel (https://github.com/bazelbuild/stardoc/blob/master/docs/future_plans.md). But I don't think that will come any time soon.

Until then, I want to have a zero-config replacement for Stardoc which requires lots of instruments to get even the most basic docs.

adonovan commented 1 year ago

the core concepts like rule, aspect, and provider remain dynamically defined at runtime.

Fair point about the dynamism of the language being a limit on what you can achieve using the static approach. (I mistakenly thought you were making an analogy with .bzl as opposed to meaning it literally.)

Correct, but good enough in terms of doc generation.

I'm glad you linked the Starlark future plans doc, because I think it is evidence that the "dynamic fake" approach is actually not good enough for doc generation, and that really it's better to get the build tool to execute the initialization of the .bzl files using 100% accurate semantics and then dump an easily parseable representation of the dynamic effects (e.g. rules, providers, aspects) from which you can generate documentation. (I argued for this approach while I was working on the Java implementation of Starlark used by Bazel, as Stardoc was the reason for a large number of hacky intrusions into the intepreter that would not have been necessary had Stardoc called "bazel query" as a subprocess.)

ashi009 commented 1 year ago

the core concepts like rule, aspect, and provider remain dynamically defined at runtime.

Fair point about the dynamism of the language being a limit on what you can achieve using the static approach. (I mistakenly thought you were making an analogy with .bzl as opposed to meaning it literally.)

Correct, but good enough in terms of doc generation.

I'm glad you linked the Starlark future plans doc, because I think it is evidence that the "dynamic fake" approach is actually not good enough for doc generation, and that really it's better to get the build tool to execute the initialization of the .bzl files using 100% accurate semantics and then dump an easily parseable representation of the dynamic effects (e.g. rules, providers, aspects) from which you can generate documentation. (I argued for this approach while I was working on the Java implementation of Starlark used by Bazel, as Stardoc was the reason for a large number of hacky intrusions into the intepreter that would not have been necessary had Stardoc called "bazel query" as a subprocess.)

I'd love to have this built into the bazel as a subprocess. But I'm a little skeptical about when it will be ready (almost 2 years overdue already). My somewhat production-ready experiment with the patched starlark-go is about 1000-loc, and already delivering better results compared to stardoc, moreover, it doesn't need bzl_library at all.

From this perspective, I really want to have this patch to be merged (after some revising on naming and doc).

adonovan commented 1 year ago

From this perspective, I really want to have this patch to be merged (after some revising on naming and doc). I'm proposing that let Thread.Load returns a StringDict-like interface with Get and Keys methods instead.

Thread.Load cannot be changed without breaking API compatibility. We could add a "Thread.Load2" hook that takes precedence, but I still don't really understand how StringDictLike is materially different from StringDict. Both data types have operations to get the keys, test key membership, or look up a value, but in order to implement any of those operations you need to have executed the module and found its set of top-level key/value pairs. A load statement will always perform at least one Get operation, and the execution of the statements after the load statement will require the correct value to have been "got". So how does it help?

ashi009 commented 1 year ago

Here is how this trick plays out. Instead of loading top-level variables, I can pretend that they are loaded -- just unknown. In Bazel context, this won't hurt much -- as a rule rarely need a symbol to complete rule definition -- they might need a foreign symbol to complete its implementation.

func (ld *Loader) load(t *starlark.Thread, module string) (starlark.StringDictLike, error) {
    l, err := label.Parse(module)
    if err != nil {
        return nil, err
    }
    fsys, ok := ld.moduleRoots[l.Repo]
    if !ok {
        return unknownModule{l.String()}, nil
    }
...
}

type unknownModule struct{ name string }

func (m unknownModule) Keys() []string { return nil }
func (m unknownModule) Has(name string) bool {
    glog.Warningf("Unresolved symbol %q from %q", name, m.name)
    return true
}
func (m unknownModule) Get(name string) starlark.Value {
    return UnknownSymbol{}
}

I'm open to the idea of introducing Load2, but I'd argue that starlark-go is not hitting v1.0 yet, and perhaps a cleaner API is preferable?

adonovan commented 1 year ago

I still don't understand. If the point of the trick is to allow you to execute the file containing the load statement without knowing anything about the module that it loads, then what will happen when execution encounters a statement that causes Get to be called? It will create an UnknownSymbol value and then execution will go wrong. How is that useful?

If you don't care about the actual values in the module it's easy to fake one by creating a real StringDict whose keys are the names mentioned in the load statement and whose values are all UnknownSymbol.

ashi009 commented 1 year ago

I still don't understand. If the point of the trick is to allow you to execute the file containing the load statement without knowing anything about the module that it loads, then what will happen when execution encounters a statement that causes Get to be called? It will create an UnknownSymbol value and then execution will go wrong. How is that useful?

Frist of all the UnknownSymbol is both callable and acts as a dict and will always return another UnknownSymbol for further invocations.

type UnknownSymbol struct{}

var (
    _ starlark.HasAttrs = UnknownSymbol{}
    _ starlark.Callable = UnknownSymbol{}
)

Second, generating docs only requires executing the part of code that declares rule, provider, and aspect. A ruleset rarely needs symbols from another repo to complete its rule definition. As long as you are not executing the rule's implementation function, nothing will fail.

This statement holds against a dozen of top rulesets.

If you don't care about the actual values in the module it's easy to fake one by creating a real StringDict whose keys are the names mentioned in the load statement and whose values are all UnknownSymbol.

Well, this is not easy to pull off with the existing high-level APIs:

The alternative is to use:

However, this requires no API change, and I'm willing to give it a shot.

adonovan commented 1 year ago

The alternative is to use: https://pkg.go.dev/go.starlark.net/syntax#Parse to parse the source https://pkg.go.dev/go.starlark.net/syntax#Walk to find the load statements and extract those names and create a StringDict https://pkg.go.dev/go.starlark.net/starlark#FileProgram to load the parsed syntax.File However, this requires no API change, and I'm willing to give it a shot.

Right, that's what I had in mind. I don't think you need Walk, BTW, since LoadStmts are all at top level.