google / cel-spec

Common Expression Language -- specification and binary representation
https://cel.dev
Apache License 2.0
2.93k stars 227 forks source link

Unbounded length for identifiers #413

Open hudlow opened 2 weeks ago

hudlow commented 2 weeks ago

I'd like to propose that identifiers be, with whatever caveats are necessary for backward compatibility, bounded to a reasonable length. The 63-character limit for DNS names would be great if we can get away with it — otherwise maybe 255?

CLOVIS-AI commented 2 weeks ago

Is there no bound at all currently? I can't find any in the spec 🤔

hudlow commented 2 weeks ago

I can't find one in the spec or the cel-go implementation.

Alfus commented 2 weeks ago

It looks like protobuf doesn't have a max identifier length, does json?

hudlow commented 2 weeks ago

JSON doesn't, but I would consider this a bug.

TristonianJones commented 2 weeks ago

Neither proto nor JSON define limits, though there are practical limits within different implementations. The guidance about operator counts is intended mostly to establish rough guidance on what parsers should reasonably be expected to support, though they may support more. I don't see a reason to be proscriptive about the identifier lengths at a CEL level since users could add custom validations that impose such limits per use-case.

That said, maybe I am misunderstanding the nature of the request? @hudlow could you elaborate a bit more about how having a limit would be helpful?

hudlow commented 2 weeks ago

@TristonianJones As I develop a parser, I have three goals in mind:

  1. Savvy expression authors are never surprised by undocumented limitations because the lines between valid and invalid expressions are crisp and unambiguous.
  2. The evaluation of untrusted expressions is truly safe and robust, because all undefined failure modes have been eliminated and airtight limitations are imposed well before any kind of catastrophic failure (like running out of memory).
  3. Different implementations can be configured to accept the exact same set of expressions such there is true portability—e.g., the ability to have parser-checkers with A and B implementations and evaluators with A and B implementations with no detectable inconsistency based on which implementations parse and check or evaluate an expression.

These are principles I apply to the design of any API, and I would have thought them to be consistent with CEL's values.

TristonianJones commented 2 weeks ago

CEL can, at times, inherit the limitations of the data types it is provided. Since CEL isn't a data definition language, I'm reluctant to place limits that alternative type systems might hit as it's not just used with JSON and proto. CEL is also not a self-contained runtime.

It' not possible to prevent all failure modes, nor it is possible to prevent running out of memory (at least not at the syntactic level) because CEL has no concept of its memory limits. These limits must be configured and CEL offers runtime controls for failing early based on the limitations of the host process.

Since CEL is flexible in its use, it would typically be up to the application to determine which practices beyond those described as part of CEL which are required to use the language and to be portable with their desired use case. Networking would have totally different limitations than an expert system, for example.

While I do like being able to describe limitations, and make guarantees about behavior, I feel as though what's described is a set of limitations for a particular use case which may not cover all uses of CEL.

hudlow commented 2 weeks ago

A few thoughts in response to that:

It' not possible to prevent all failure modes, nor it is possible to prevent running out of memory (at least not at the syntactic level) because CEL has no concept of its memory limits.

But it is possible to define all failure modes, and the failure mode of "if you try to use an identifier with a terabyte's worth of characters something will probably break" is undefined.

These limits must be configured and CEL offers runtime controls for failing early based on the limitations of the host process.

I think a key question is where do we specify what these runtime controls are such that (1) they offer sufficient control and (2) implementations provide them uniformly?

TristonianJones commented 2 weeks ago

Neither Java, Go, nor C++ declare an identifier character limit (neither do they seem to have practical limits on parsable numeric representations), but you are likely to encounter practical limits that trigger a blow-up with sufficiently long names. Conventional wisdom says this limit is 64KB in Java, but it's not a rule.

In terms of portability, CEL focuses on runtime portability of the parsed and type-checked expression meaning that expressions will evaluate the same way no matter which stack parsed them. I wouldn't recommend passing unparsed expressions between processes. For parsing and type-checking, we tend to have fewer controls, but the ones that exist are intended to prevent breakages due to pathological inputs. The Golang and Java CEL libraries also provide validators which can be used to sanity test the AST structure prior to deploying the expression. Most applications have a practical size limit either in terms of the number of expression nodes in the AST total program byte size, but there isn't a practical limit to what this might be from CEL's perspective. We've seen some 500KB expressions, for example. Personally, I think 64KB is much larger than most CEL expressions should ever be.

For runtime controls, most are configurable through options provided to the runtime builders. Most are similarly named, though there are some with slight behavior differences due to implementation choices that make exact parity difficult. Over time, I expect the name and behavior of these controls to become more and more standardized.

If CEL were to introduce parse-time limits for identifiers and numbers, it would have to be quite large, something like 64KB only because we'd be imposing a limit beyond what languages tend to define. Would that work for you? It's large, but not large enough to cause any practical problems on any platform.

hudlow commented 2 weeks ago

I'd definitely take 64KB over no limit, but I'd still like a standard option supported by every implementation. I am not sure I understand the assumptions that (1) Java, Go, and C++ got this right and (2) the decision for them is the right decision for CEL which is very intentionally constrained in other ways that they are not.

Anyway, I do appreciate the dialogue!

I wouldn't recommend passing unparsed expressions between processes.

I hope I can convince you that there is great utility to end-to-end portability. 🙂