google / cel-go

Fast, portable, non-Turing complete expression evaluation with gradual typing (Go)
https://cel.dev
Apache License 2.0
2.29k stars 222 forks source link

Resolve multiple packages with cel.Container #340

Closed sfllaw closed 4 years ago

sfllaw commented 4 years ago

Feature request checklist

Change

Currently, cel.Container can be used to specify a namespace for symbol resolution. For example:

env, _ := cel.NewEnv(
    cel.Container("a"),
    cel.Types(&apb.A{}),
)
ast, _ := env.Compile("A{}")
prg, _ := env.Program(ast)
prg.Eval(map[string]interface{}{})

In order to write programs that reference multiple protobuf packages, we are forced to use fully qualified names for types outside the container:

env, _ := cel.NewEnv(
    cel.Container("a"),
    cel.Types(&apb.A{}, &bpb.B{}),
)
ast, _ := env.Compile("b.B{}")
prg, _ := env.Program(ast)
prg.Eval(map[string]interface{}{})

Example

We should provide the ability to resolve multiple namespaces. There are two options:

  1. Container could take multiple package names:

    func Container(pkgs ...string) EnvOption
  2. Env could be initialized with a custom packages.Packager:

    func Package(pkg packages.Packager) EnvOption

Alternatives considered

It’s possible that both these methods would be useful. Here is a sample implementation:

package cel

func Container(pkgs ...string) EnvOption {
    return func(e *Env) (*Env, error) {
        return e.Package(packages.NewPackage(pkgs...))
    }
}

func Package(pkg packages.Packager) EnvOption {
    return func(e *Env) (*Env, error) {
        e.pkg = pkg
        return e, nil
    }
}

//////

package packages

type defaultPackage struct {
    pkgs []string
}

func NewPackage(pkgs ...string) packages.Packager {
    seen := make(map[string]bool, len(pkgs))

    var p defaultPackage
    for _, pkg := range pkgs {
        if seen[pkg] {
            continue
        }
        p.pkgs = append(p.pkgs, p.pkgs)
        seen[pkg] = true
    }
    return &p
}

func (p *defaultPackage) Package() string {
    return strings.Join(p.pkgs, " ")
}

func (p *defaultPackage) ResolveCandidateNames(name string) []string {
    if strings.HasPrefix(name, ".") {
        return []string{name[1:]}
    }

    if len(p.pkg) == 0 {
        return []string{name}
    }

    for _, nextPkg := range p.pkgs {
        candidates := []string{nextPkg + "." + name}
        for i := strings.LastIndex(nextPkg, "."); i >= 0; i = strings.LastIndex(nextPkg, ".") {
            nextPkg = nextPkg[:i]
            candidates = append(candidates, nextPkg+"."+name)
        }
    }

    return append(candidates, name)
}
TristonianJones commented 4 years ago

Thanks for the detailed feature request. I know one of the pain points of working with proto in CEL is the requirement to specify nested type names, e.g.

Msg{
  nested: Nested{
     field: "hello"
  }
}

Given the type-inferencing capabilities of CEL one might (reasonably) expect the following to be supported:

Msg{
   nested: { // no type name required!
     field: "hello"
   }
}

The above isn't supported in part due to collision between map and message literal syntax that has proven difficult to undo from a design perspective because of its potential to subtly change evaluation behavior. However, I'm curious to know if complex object construction is your use case, as we've toyed with the following notation:

Msg{
   // magic '_' type name which gets resolved at check / runtime contextually.
   // error if the type of 'nested' is polymorphic, e.g. protobuf.Any as it can't be inferred
   // accurately.
   nested: _{ 
     field: "hello"
   }
}

The above has it's own issues, but this is as far as I've gotten with respect to trying to simplify this use case.

sfllaw commented 4 years ago

@TristonianJones, thanks for the quick reply.

I’m not talking about type inferencing: I’m talking about cel.Container and resolving identifiers:

Imagine we have a few protobuf enums: a.b.c.Color and d.Direction. Since cel.Container can only take one name, we have to choose between cel.NewEnv(cel.Container("a.b.c")) or cel.NewEnv(cel.Container("d") if we want to simplify the namespaces for CEL authors.

Arguably, we’d want to write cel.NewEnv(cel.Container("a.b.c", "d")) so that authors can simply write Color.RED and Direction.NORTH. I’d be happy to submit a pull request, if this approach is something you’d be likely to accept.

TristonianJones commented 4 years ago

My concern with this feature is that it introduces a lot of complexity at runtime for parse-only expressions.

Within a single container it's possible more than one identifier could be matched, but with the namespace resolution rules it's clear which identifier wins. If there were multiple containers, it would lead to ambiguous precedence rules for identifier resolution (at least from a user perspective, especially for parse-only expressions).

Part of the reason I asked about object construction is because there are ways in which we could unambiguously resolve the proto type name with less user input. If you're doing lots of comparisons with proto enums, then the next best tactic I've seen is to create a container alias map:

{"A": "really.long.pkg.container.path.Enum",
 "B": "other.really.long.pkg.container.path.Enum"}

From the language perspective, there is always at least one leading identifier before specifying an enum which makes resolution unambiguous, and the namespace resolution rules still apply. The trick is in manipulating the parsed output to replace the identifiers A and B with their fully qualified names in order to preserve the existing runtime and check semantics.

// Expanded to: msg.Status != really.long.pkg.container.path.Enum.ERROR
msg.Status != A.ERROR 

It's also possible that using such an approach could generate strange error messages at check time which wouldn't relate exactly to user-entered source, but the tradeoff might be worth it.

sfllaw commented 4 years ago

My suggestion is to extend cel.Container so that it also has a well-defined order. Given cel.Container("a.b.c", "d"), a call to ResolveCandidateNames("Direction.NORTH") should result in: a.b.c.Direction.NORTH, a.b.Direction.NORTH, a.Direction.NORTH, d.Direction.NORTH, Direction.NORTH. I agree that any ambiguity would be disastrous.

I also note that this proposal preserves the fact that the last cel.Container option wins, when passed to cel.NewEnv. I don’t know if that’s what you meant by “multiple containers”, but changing this would make it impossible to replace the container, and would also be backwards incompatible.

We considered using cel.Globals to implement an alias map but discovered that this is quite clunky. We found that doing this required reflecting across a lot of our protobuf definitions. Are you suggesting some other mechanism? It seems unfortunate that we can’t use the existing packages.Packager machinery to do this.

Are you worried that people will add too many package names to packages.Packager, which would slow down name resolution? We have hacked together a prototype multiPackage implementation that we’ve monkeypatched in. This worked well, although we abandoned it in favor of working with you upstream.

TristonianJones commented 4 years ago

@JimLarson thoughts? The proposed feature would augment the proto namespace resolution logic used in CEL today.

Pros

  1. @sfllaw has proposed a consistent way of evaluating namespaces with resolution occurring within the container order provided by the Container option.
  2. The change is forward compatible with the existing SDK API (varargs instead of single string arg).
  3. The feature solves a particularly nasty problem with cross-package proto references during object construction without doing post-parse expression rewrite like we've seen elsewhere.
  4. It's been implemented once and works. :)

Cons

  1. Added evaluation overhead for parse-only expressions.
  2. The C++ runtimes would need to be updated to support the feature before enabling it as an option in the Go due to the runtime implications of parse-only package resolution.
  3. Might lead to some confusion with respect to the fact that setting a Container overrides more than one namespace during Environment extension.

In the analysis, the Pros definitely outweigh the Cons with respect to the ease of use improvement, and I vastly prefer this to the package aliasing approach used in certain products. I think we'd just need to get serious about fixing the namespace resolution behavior in C++ before we consider enabling it for go. Maybe add a 'CompatMode' that checks for things that are not cross-language compatible if we choose to do it in Go first?

Curious to hear your thoughts.

TristonianJones commented 4 years ago

@sfllaw I just discussed this with the CEL language council and we're generally in favor, but want to take some time to figure out a plan of attack as this could affect other evaluator implementations and we want to make sure we've covered all our bases. I should have an answer for you by May 1.

Thanks for the FR and for your patience. :)

JimLarson commented 4 years ago

Let's call the feature "container paths", analogous to the shell's search path for binaries, compiler library paths, etc.

We already have container paths, in a limited way, in that specifying container "a.b.c" gives us a search path of "a.b.c", "a.b", and "a". If we allow the container path of "a.b.c:x.y.z" (to borrow the shell notation), then we need to decide on the relative order of the expansions. The simplest would be a.b.c, a.b, a, x.y.z, x.y, x. However in the case where we specify "a.b.c:a.d.e", and we have messages "a.Foo" and "a.d.e.Foo", it might be surprising that "a.Foo" gets chosen.

Perhaps with container paths available we wouldn't need the implicit paths of shortened prefixes - you just list them explicitly as you desire them to avoid surprises.

I'm not too concerned about the added parse-time overhead - those who are sensitive to the performance delta should be using the checker anyhow.

Package aliasing vs search path: I agree that aliasing has some benefits for ease of implementation. But I can see cases where I'd really want a "using" directive so that I can write "Foo" instead of "a.Foo".

Question: do containers and container paths affect qualified function name resolution? Should they? Since it's a different namespace (in the Lisp-1 vs Lisp-2 sense), it's not clear that it would be used the same way. Of course, you can say the same about variable names vs proto message names.

sfllaw commented 4 years ago

We already have container paths, in a limited way, in that specifying container "a.b.c" gives us a search path of "a.b.c", "a.b", and "a". If we allow the container path of "a.b.c:x.y.z" (to borrow the shell notation), then we need to decide on the relative order of the expansions. The simplest would be a.b.c, a.b, a, x.y.z, x.y, x. However in the case where we specify "a.b.c:a.d.e", and we have messages "a.Foo" and "a.d.e.Foo", it might be surprising that "a.Foo" gets chosen.

The two alternatives are:

The depth-first search that I proposed is probably easiest to explain. The breadth-first search is probably more intuitive. But this seems like it would be a matter of perspective: some people will get confused either way.

Perhaps with container paths available we wouldn't need the implicit paths of shortened prefixes - you just list them explicitly as you desire them to avoid surprises.

This suggestion is an interesting one. It would mean that cel.Container("a.b.c") would actually return cel.ContainerPath("a.b.c", "a.b", "a") to preserve backwards compatibility.

Question: do containers and container paths affect qualified function name resolution? Should they? Since it's a different namespace (in the Lisp-1 vs Lisp-2 sense), it's not clear that it would be used the same way. Of course, you can say the same about variable names vs proto message names.

It appears that containers already affect function resolution already. If we implement this using the packages.Packager interface, the existing behavior will be preserved and the container path behavior will be an obvious extension: https://github.com/google/cel-go/blob/c4c3df541e541a8cc2fa0b5760fc88cf29f5e58d/checker/env.go#L129-L138

TristonianJones commented 4 years ago

@JimLarson I'm not really sure how I feel about ContainerPath since it could be used to subvert the typical understanding of a namespace and make programs harder to reason about.

Granted, if we were to try to remain compatible with C++ namespace resolution, we would permit a using directive:

env, err := cel.NewEnv(
  cel.Container('co.acme.apps.hr',
     // Everything after the first container is a 'using' directive.
     // Alternative, make a cel.Using EnvOption.
     'co.example.acquisition.salaries',
     'org.math.stats.avg'),
),
// Compiles to:
//    org.math.stats.avg(
//        co.acme.apps.hr.readAll(
//            co.example.acquisition.salaries.Query{title: "manager"}))
env.Compile(`avg(readAll(salaries.Query{title:"manager"}))`)

In the above example, Container resolution still behaves as before, but avg and salaries have now been added to the co.acme.apps namespace. This mirrors how C++ namespacing works, but without all of the complex rules around using directive scoping. Following in the foosteps of C++ would add more complexity in that the 'using' style declarations would also need to support resolution of the following qualified names to their original location:

// Must compile to the same result as the previous example.
co.acme.apps.hr.avg(readAll(co.acme.apps.hr.salaries.Query{title: "manager"}))

Pro:

Con:

My suggestion is to extend cel.Container so that it also has a well-defined order. Given cel.Container("a.b.c", "d"), a call to ResolveCandidateNames("Direction.NORTH") should result in: a.b.c.Direction.NORTH, a.b.Direction.NORTH, a.Direction.NORTH, d.Direction.NORTH, Direction.NORTH

With C++ semantics, the original search space would be preserved and the symbol Direction would not be found unless specifically added as a using directive: cel.Container('a.b.c', 'd.Direction')

Thoughts?

sfllaw commented 4 years ago

In the above example, Container resolution still behaves as before, but avg and salaries have now been added to the co.acme.apps namespace. This mirrors how C++ namespacing works, but without all of the complex rules around using directive scoping. Following in the foosteps of C++ would add more complexity in that the 'using' style declarations would also need to support resolution of the following qualified names to their original location:

// Must compile to the same result as the previous example.
co.acme.apps.hr.avg(readAll(co.acme.apps.hr.salaries.Query{title: "manager"}))

Injecting new symbols into an existing namespace seems like it would be prone to error. It adds org.math.stats.avg as an alias to co.acme.apps.hr.avg, which is not how I would expect this to work. Also, I cannot think of a reasonable way to deal with a namespace conflict if co.acme.apps.hr actually added an avg implementation.

TristonianJones commented 4 years ago

@sfllaw my framing here is somewhat as a devil's advocate. In practice, we shouldn't be innovating where there are existing standards, especially since CEL's already tried to align with C++ here so I've tried to reason in the manner most consistent with the prior art CEL followed.

Namespaces are tricky. I don't really want to say "CEL containers are like C++ namespaces, except when they're not". Because all identifiers for variables, types, and functions are declared explicitly, it should be trivial to detect and raise collisions. @JimLarson @sfllaw thoughts?

sfllaw commented 4 years ago

In practice, we shouldn't be innovating where there are existing standards, especially since CEL's already tried to align with C++ here so I've tried to reason in the manner most consistent with the prior art CEL followed.

I agree with this in principle, but in this instance, I don’t think that containers look like C++ namespaces with using because the following CEL:

// cel.NewEnv(
//  cel.Container("co.acme.apps.hr",
//      "co.example.acquisition.salaries",
//      "org.math.stats.avg"))
hr.avg(hr.readAll(hr.salaries.Query{title: "manager"}))

does not translate to:

using co::acme::apps::hr {
  using co::example::acquisition::salaries;
  using org::math::stats::avg;
  hr::salaries::Query query;
  query.set_title("manager");
  hr::avg(hr::readAll(query));
}

since C++ won’t walk up the parent namespaces, while CEL does. I think it’s more important that CEL is consistent with itself, over being consistent with C++. It would be weird if the container did name resolution but a cel.Using directive did not.

Namespaces are tricky. I don't really want to say "CEL containers are like C++ namespaces, except when they're not".

I suspect that “CEL containers are like C++ namespaces” is inaccurate, even with the current spec? @TristonianJones, are you arguing that CEL changes so that it this statement is true? In which case, this might be the backwards-compatible way to do this, although the name resolution behavior will still be weird.

func Container(pkg string) EnvOption {
    return func(e *Env) (*Env, error) {
        e.pkg = packages.NewPackage(pkg)
        fields := strings.Split(pkg, ".")
        for i := len(fields) - 2; i != 0; i-- {
            Using(strings.Join(fields[:i], "."), nil)
        return nil
    }
}

func Using(name, alias string) EnvOption {
    return func(e *Env) (*Env, error) {
        e.pkg.Using(name, alias)
        return nil
    }
}

func (p defaultPackage) Using(name, alias string) {
    if alias == nil {
        alias = name[strings.LastIndex(name, "."):]
    }
    p.using[alias] = name
}
TristonianJones commented 4 years ago

Hi @sfllaw, the cel.Container is specified to match and implemented in terms of C++ namespace resolution rules. I was suggesting that as an alternative to having multiple containers, we could consider introducing a using directive as it would solve similar usability concerns and has shared design ancestry with CEL containers.

My expectation is that the following example ...

// cel.NewEnv(
//  cel.Container("co.acme.hr",
//              // using directives
//      "co.example.acquisition.salaries",
//      "org.math.stats.avg"))
hr.avg(hr.readAll(hr.salaries.Query{title: "manager"}))

... Maps to C++ that looks like this:

namespace co {
namespace acme {
namespace hr {
  using co::example::acquisition::salaries;
  using org::math::stats::avg;
  salaries::Query query;
  query.set_title("manager");
  avg(readAll(query));
} // namepace hr
} // namespace acme
} // namespace co

In C++, the variable resolution would resolve the following paths for avg:

Since co::acme::hr::avg is an alias to org::math::stats::avg, the type-checker should emit the non-aliased name. The fun part is when functions have different overloads, say org.math.stats.avg(list(int)) is defined and co.acme.avg(vector(double)) is defined, it's totally possible to refer to each overload with the using style directive, so we'd have to consider how this impacts overloading ... same is true for multiple namespaces.

sfllaw commented 4 years ago

As a counterexample, it is possible in CEL to make a container that resolves names deeper than a C++ namespace? For example, given the following protobuf definition:

package d;

message Compass {
  Direction direction = 1;

  enum Direction {
    UNKNOWN = 0;
    NORTH = 1;
    EAST = 2;
    SOUTH = 3;
    WEST = 4;
  }
}

we can ask the container to resolve names within the enum:

// cel.NewEnv(
//  cel.Container("d.Compass.Direction"),
//  cel.Types(&pb.Compass{}),
// )
Compass{direction: NORTH}

You might suggest that this is a bad way to use a container, but this is the current behavior. Given this is the case, does it really make sense to inject using directives into the enum’s “namespace”? Do enums even have a namespace? :thinking:

The mental model that I’ve been using for cel.Container is that it expands out into the following Python, if Python let you import classes and their members:

import d
import d.Compass
import d.Compass.Direction
TristonianJones commented 4 years ago

@sfllaw Hopefully, you feel like you're getting a deep look at how decisions are debated for CEL and appreciate that you've raised an interesting idea that we really want to pursue. It's raised some interesting points about how containers are spec'd even now:

  1. What is the purpose of permitting an ENUM name to be a container? Should we set stricter limits on container names?

Does it make sense to use an enum in a using declaration. Sure! Especially if you're trying to provide an upgrade path to use a new enum since proto supports an open enum set. Does it make sense to permit enum names as containers? No, but it probably doesn't really hurt much either. The feature is intended to improve the ease of use for nested proto messages, since nested messages take on the outer proto message name as part of their namespace. e.g.

env, _ := cel.NewEnv(cel.Container(`google.rpc.context.AttributeContext`))
// Resource is a nested message in `google.rpc.context.AttributeContext`
env.Compile(`Resource{name: 'test'}`)

At any rate, container names are more limited since it's difficult to tell a priori whether the container refers to an enum. And, if it does, the worst case is that a single enum value is accessible by its simple name. I feel like this is a problem which developers will self-solve which is why we haven't added more restrictions on its use.

  1. Should namespaced function overloads combine with overloads defined further up the container hierarchy?

PR #341 ensures that evaluation agrees with the existing behavior of the type-checker: first namespaced function found wins. It's a necessary first step, but I don't think it's sufficient, since a proper fix would also update the spec to discuss function overloading and namespace resolution (either the spec is wrong, or the impl, or both).

What I like about using directives is that it preserves a single hierarchy walk up the container when searching for identifiers. If we fix the function behavior to support namespaced overloads as part of a function definition, then it means the search cost for the overload set is limited to the number of components c in a qualified container name as opposed to m hierarchy walks for m containers.

sfllaw commented 4 years ago

What I like about using directives is that it preserves a single hierarchy walk up the container when searching for identifiers. If we fix the function behavior to support namespaced overloads as part of a function definition, then it means the search cost for the overload set is limited to the number of components c in a qualified container name as opposed to m hierarchy walks for m containers.

The performance characteristic of your proposal is unquestionably good. I also see the advantages for the implementation because “worse is better”.

I think my fundamental objection is that the using directive adds a new name to an existing namespace, which makes things less clear for users: if someone reads to hr.salaries.Query or co.acme.hr.salaries.Query in a rule program, they won’t find it in the protobuf definition when they look at package co.acme.hr because that symbol has been injected into the container’s namespace. In addition, the Env is locked into this namespace forever.

Correct me if I’m wrong, but currently there is nothing that injects symbols into a container: cel.Declarations, cel.Macros, and cel.Types all add symbols into the global namespace. I think this property is good to preserve, because it avoids the problem above.

Perhaps we should circle back to the alias map idea? If first-class aliases were rooted in the global namespace, then resolving salaries.Query wouldn’t short-circuit as quickly as your proposal, but it would be exactly as expensive as resolving co.example.acquisition.salaries.Query:

env, err := cel.NewEnv(
    cel.Types(&spb.Query),
    cel.Declarations(
        decls.NewFunction("org.math.stats.average", ...),

        // Automatically derive the alias name:
        decls.NewAlias("co.example.acquisition.salaries")

        // Explicitly name the alias:
        decls.NewAliasAs("avg", "org.math.stats.average"),

        // Probably a misfeature: from org.math.stats import *
        // decls.NewAliasAll("org.math.stats"),
    ),
    cel.Container("co.acme.hr"),
)

// Compiles to:
//    org.math.stats.average(
//        co.acme.hr.readAll(
//            co.example.acquisition.salaries.Query{title: "manager"}))
env.Compile(`avg(readAll(salaries.Query{title:"manager"}))`)

// Parse error: hr.avg and hr.salaries not found
env.Compile(`hr.avg(hr.readAll(hr.salaries.Query{title:"manager"}))`)

Under this proposal, aliases would be shadowed by cel.Globals and cel.Program.Eval, matching the current behaviour.

TristonianJones commented 4 years ago

@sfllaw I'm still pondering the tradeoffs here. At this point I'll probably start creating a public Google Doc to capture the options.

JimLarson commented 4 years ago

I've had to learn more C++ than I've ever cared to in order to catch up on this conversation, but maybe that will pay off.

What I'm not hearing is a strong request for following two hierarchy paths at once: if "a.b.c" is the "primary" container, and we also want fast access to stuff under "x.y.z", we don't necessarily have to add "x.y" and "x" automatically. It's okay to have to add them explicitly when needed.

I like the idea of keeping the container an arbitrary string of dot-separated identifiers. CEL should only require declarations for what's being actually used. We should be able to specify a container "a.b.might.exist" and declare only the types in "a.b" if those are the only things we want to reference at the moment.

The C++ name resolution rules are quite complex, but it seems like certain using statements might throw errors based on name collisions. Despite Tristan's hesitation at partial compliance with a standard, I'd much prefer it if we tolerated collisions and resolved them through shadowing, rather than producing errors, so that we can be tolerant to the evolution of proto files.

So if we have the current Container("a.b.c") option and add a Using("x.y.z"...) option, then the name resolution can work by the path:

  1. Look up in the container (if any).
  2. Look up in each Using namespace, in order.
  3. Finally, look up in successively shorter prefixes of the container (if any).

This way, Container and Using can be used separately or together.

sfllaw commented 4 years ago

@JimLarson, I would personally avoid Using unless CEL wants to draw an analogy to the C++ using directive, with all of its implications. Personally, I find that C++ namespaces are subtle compared to other approaches.

Also, should Using even be an Env level construct? Since I couldn't distinguish them from variables or functions, I suggested that we make them a NewAlias in Env.declarations, which would work much like NewIdent. I suppose the only wrinkle is that I don't understand how aliasing a type would work, which you might want to do if a type gets renamed and you need to provide backwards compatibility.

Aliases also imply cycle-checking, so this is a disadvantage to this approach.

TristonianJones commented 4 years ago

@sfllaw @JimLarson I think I'd summarize the options under consideration as:

I need to organize the threads here into a standalone doc, but for as varied as the options appears, I think we're pretty close to agreeing on the desired scope and application of what we choose to introduce. I will try to follow up later tonight with a document link.

sfllaw commented 4 years ago

@TristonianJones I would point out that the import could also import the symbol into the global namespace, which would be accessible via .avg(). This would avoid the need to error on conflict.

TristonianJones commented 4 years ago

Hi @sfllaw,

After discussing the feature more with the rest of the language council, we're leaning toward aliasing as a post-parse step as a way of capturing the narrowest set of use cases supported by import and using without implying the broader functionalities of either. I've got an action item to revise a doc I wrote to this effect and I'll share the revised proposal with you later this week.

Thanks for your patience and for spurring such a wonderful conversation here.

Regards,

-Tristan

TristonianJones commented 4 years ago

I've shared the discussion doc with the cel-go-discuss@googlegroups.com list. If anyone is interested in carrying on the conversation, all members of the group have comment privileges: https://docs.google.com/document/d/1F8BuG4SNFUXYPHi7nXaQmDxUsWCRDr1PfF789KkcxMY/edit#heading=h.z94vuodaszr