tc39 / proposal-binary-ast

Binary AST proposal for ECMAScript
965 stars 23 forks source link

Explicitly annotating Scope and IdentifierName #60

Closed Yoric closed 6 years ago

Yoric commented 6 years ago

I'm not terribly familiar with ES6 modules, so if someone more familiar than me could take a look in particular at that part of the patch and tell me if I've done something meaningless, I'm very interested.

This PR replaces #42.

Yoric commented 6 years ago

Looks like, Identifier inside Asserted* is IdentifierDeclaration, and otherwise IdentifierReference.

I think so. I'm not 100% certain this is going to remain true forever, though. What do you think?

as long as we're using different interface hierarchy for Node and Asserted*, the distinction is already known without introducing IdentifierDeclaration/IdentifierReference. (this would depend on #56)

I don't follow.

arai-a commented 6 years ago

Looks like, Identifier inside Asserted* is IdentifierDeclaration, and otherwise IdentifierReference.

I think so. I'm not 100% certain this is going to remain true forever, though. What do you think?

all declaration should be covered by Asserted*, so at least it should be true for declaration. if we use reference for declaration syntax (given it refers to the declaration in Asserted*), that should also be true for reference.

as long as we're using different interface hierarchy for Node and Asserted*, the distinction is already known without introducing IdentifierDeclaration/IdentifierReference. (this would depend on #56)

I don't follow.

I meant, just using single Identifier and treating it as IdenfitierDeclaration if it appears under Asserted*, or IdentifierReference if it appears elsewhere would do the same thing. so I'm not sure how much this helps, compared to [Scope] which is not immediately obvious from the structure.

Yoric commented 6 years ago

Looks like, Identifier inside Asserted* is IdentifierDeclaration, and otherwise IdentifierReference.

I think so. I'm not 100% certain this is going to remain true forever, though. What do you think?

all declaration should be covered by Asserted*, so at least it should be true for declaration. if we use reference for declaration syntax (given it refers to the declaration in Asserted*), that should also be true for reference.

I can imagine one future in which we end storing relationships between identifiers (e.g. for alias analysis purposes), in which case the declaration of an Asserted* could possibly contain a reference to an existing identifier.

I'm not holding that this is a likely future, so if you think that it's unlikely and/or can be hinted in a different manner, I'm not going to insist on that bit :)

as long as we're using different interface hierarchy for Node and Asserted*, the distinction is already known without introducing IdentifierDeclaration/IdentifierReference. (this would depend on #56)

I don't follow.

I meant, just using single Identifier and treating it as IdenfitierDeclaration if it appears under Asserted*, or IdentifierReference if it appears elsewhere would do the same thing. so I'm not sure how much this helps, compared to [Scope] which is not immediately obvious from the structure.

Ok, I can understand that part :)

If we do this, we make the Asserted* magical wrt tokenization: we cannot, in general, go from an AST (as defined in the webidl) to a binary representation without hardcoding the fact that Asserted* are special nodes.

To do it without magic, we need to introduce a hint [AssertedScope] around all Asserted* to indicate that any Identifier in an [AssertedScope] is in fact an identifier definition. Or, alternatively, we turn these identifiers into IdentifierReference in the first place (with the label [IdentifierReference]). So far, it looks to me that the latter is more natural.

What do you think?

Yoric commented 6 years ago

After discussing with @arai-a, I have attempted to implement a Rust version that only has [IdentifierName] – rather than [IdentifierReference] and [IdentifierDeclaration] – but that ought to be able to determine, when inspecting an [IdentifierName], whether it's a reference (as visible in the user-level syntax) or a declaration (as visible in the Asserted*Scope).

@arai-a, @syg, any thoughts?

Yoric commented 6 years ago

I have added to the PR a patch that kind-of implements IdentifierReference vs. IdentifierDeclaration without support in webidl. This works in the encoder, but it doesn't in the decoder, because we don't have the magic to make the decoder understand that it should switch from IdentifierReference to IdentifierDeclaration when it enters an Asserted*Scope.

As far as I can tell, this would require either labeling [IdentifierReference]/[IdentifierDeclaration], as I've done above, or adding a label such as [Declarations] to Asserted*Scope and specifying that we're never, ever going to put an IdentifierReference in an Asserted*Scope.

Yoric commented 6 years ago

Competing PR: #62.

syg commented 6 years ago

I can imagine one future in which we end storing relationships between identifiers (e.g. for alias analysis purposes), in which case the declaration of an Asserted* could possibly contain a reference to an existing identifier.

I'm not holding that this is a likely future, so if you think that it's unlikely and/or can be hinted in a different manner, I'm not going to insist on that bit :)

I still believe fairly strongly that we shouldn't encode any information that is more complex to verify than "does this particular AST node (e.g. the node corresponding to a var x) exist?'

Yoric commented 6 years ago

I still believe fairly strongly that we shouldn't encode any information that is more complex to verify than "does this particular AST node (e.g. the node corresponding to a var x) exist?'

I have no agenda here, it was just an example.

syg commented 6 years ago

@Yoric I'm not sure I completely understand the goal here. Is this machinery (and the competing one in #62) so that Identifier nodes in Asserted* are distinct from Identifiers in the main AST?

Why is any explicit hint necessary? Both the encoder and the decoder should know the current AST node it's processing at any one time. It seems to me the logic that's encoded in this PR for particular fields of Asserted* and other nodes should be an implementation detail in the encoder/decoder when processing those nodes. For instance, when looking at a FormalParameter, you know that it's using a binding.

Edit: I'd like to avoid semantically empty hints at the spec level that encode complex relationships (like use-def in this case) -- I believe they'll become de facto implementation-defined behavior, which would be bad for interop.

Yoric commented 6 years ago

I'll try and be as clear as I can at 1am :)

Scenario: I'm writing tooling. In this case, an encoder and a decoder.

While looking at the code, I, as a tooling developer, see three kinds of identifier names:

  1. some are actually property names;
  2. some are part of Asserted*, so they are part of the metalevel;
  3. some are actual identifiers that are actually part of the user-facing syntax.

First, can I assume that we agree that 1. deserves its own distinct type? As far as I can tell, there is no relationship between {1.} and {2., 3.}, except that they share tokens.

Now, if I understand what you write, we agree that tooling should be aware of 2. and 3. Examples of use:

If we agree, the question is how do we make tooling aware of the relationship between 2. and 3.:

I feel that Option 1 is a spec smell. In particular, I feel that it would require me to hardcode in the container format a list of nodes. Hence these PRs.

syg commented 6 years ago

Thanks for the detailed answer.

I have the opposite intuition about spec smell: having these otherwise semantics-free hints is a spec smell to me. From experience of working with web standards, I really believe something that isn't fully specified with non-trivial implied semantics is a bad idea.

If I understand the concern correctly, would the tooling need also be addressed by fully specifying an actual predicate in the spec, but leaving its use to be up to the implementers? That is, how do you feel about specifying a predicate like IsIdentifierUse and IsIdentifierDeclaration that are Node -> Boolean? That way, you have the actual predicate you want fully defined not as a hint, and if tooling so wishes to use them, they can depend on something that is fully defined.

Edit: Some more thoughts. My proposed solution is isomorphic to yours, so the only real difference is what is actually encoded: what I'm proposing is nothing be concretely encoded, but there is a normative list of nodes the spec provides for the classification you want.

All in all I would still prefer to leave those predicates up to the implementations to define. The tooling developer is expected to know the nodes, because, well, they have to know the nodes and what JavaScript programs mean to do static analysis. It doesn't seem realistic (or a good idea) for analysis authors to write tooling without understanding JS semantics. AssertedScopes, while usable for extracting semantic information, encode a simple-to-verify, purely syntactic information: that there exists some declaration nodes.

Binary AST should remain encoding of surface syntax, especially with the "it's just more or less a surface encoding" tack that we'd like to try with the CDNs. This starts to feel like it's going beyond that.

@arai-a, @tschneidereit What're your thoughts?

arai-a commented 6 years ago

DeBruijn indexing

I still don't understand why it needs to be separate type for this case. "looking up the name and use it if exists, or define if not" would work I think? (if we want validation, it should know more about the JS semantics)

Then, given this is about compression, which will eventually become a part of the spec (with container format), we could defer this until the final compression algorithm/format is ready? (with any options, the same information will be embedded into the spec, as long as we use the info in the compression)

obfuscator hypothetical ahead of time constant propagators, inliners, etc.

having extra info about the scope won't help much for those use cases. so, stripping scope information beforehand and/or (re-)generating the scope information after the modification will be better, for simplicity.

debugger

how is it related to BinAST? do we expose the raw tree with the IDL to debugger?

Option 1

just introducing a base interface for Asserted* and hardcode its name in the implementation will work. no need to hardcode the list of all interfaces in the source. just enumerating all sub-interfaces of the base class (from webidl) should do. (auto-generated code will have the same information anyway) the base interface could solve https://github.com/binast/ecmascript-binary-ast/issues/56 with the option 1 there (https://github.com/binast/ecmascript-binary-ast/issues/56#issuecomment-419504989 )

Option 2

same as above. maybe introducing another annotation for the base interface? (which just declares that it's not part of actual AST) but in any case implementation should hardcode which annotation it is.

actually I was thinking this, when we were talking about https://github.com/binast/ecmascript-binary-ast/pull/62 , not adding the annotation to all interfaces.

Yoric commented 6 years ago

I still don't understand why it needs to be separate type for this case. "looking up the name and use it if exists, or define if not" would work I think?

Well, consider

Scope 1:
   instance of variable name a
   Scope 2:
       instance of variable name b
       instance of variable name a
       instance of variable name a

If you just look at instances, how do you know whether all three instances of a are the same variable or if it's two variables that just happen to share the same name? You know in which case you are depending on whether the second instance of variable name a is a declaration or a reference. I want to make it easy for tooling (starting with encoder/decoder) to find this out.

Then, given this is about compression, which will eventually become a part of the spec (with container format), we could defer this until the final compression algorithm/format is ready?

I think we meant to specify the container format somewhat separately from the spec. Right, @syg?

obfuscator hypothetical ahead of time constant propagators, inliners, etc.

having extra info about the scope won't help much for those use cases. so, stripping scope information beforehand and/or (re-)generating the scope information after the modification will be better, for simplicity.

Fair enough. Of course, they still need to be able to answer the question I gave as example on top of this reply.

debugger

how is it related to BinAST? do we expose the raw tree with the IDL to debugger?

That's something we'd like to do, yes.

Option 1

just introducing a base interface for Asserted* and hardcode its name in the implementation will work. no need to hardcode the list of all interfaces in the source. just enumerating all sub-interfaces of the base class (from webidl) should do. (auto-generated code will have the same information anyway) the base interface could solve #56 with the option 1 there (#56 (comment) )

Yes, this would work. I'm a bit wary of hardcoding, but hardcoding a single name is definitely better than hardcoding a changing list.

Option 2

same as above. maybe introducing another annotation for the base interface? (which just declares that it's not part of actual AST) but in any case implementation should hardcode which annotation it is.

Works for me. I have the feeling that hardcoding an annotation is exactly what annotations are for, so I'm ok with that :)

Yoric commented 6 years ago

If I understand the concern correctly, would the tooling need also be addressed by fully specifying an actual predicate in the spec, but leaving its use to be up to the implementers? That is, how do you feel about specifying a predicate like IsIdentifierUse and IsIdentifierDeclaration that are Node -> Boolean? That way, you have the actual predicate you want fully defined not as a hint, and if tooling so wishes to use them, they can depend on something that is fully defined.

It's a bit trickier to auto-generate/track changes between versions, but yes, that would work.

arai-a commented 6 years ago

I still don't understand why it needs to be separate type for this case. "looking up the name and use it if exists, or define if not" would work I think?

Well, consider

Scope 1:
   instance of variable name a
   Scope 2:
       instance of variable name b
       instance of variable name a
       instance of variable name a

If you just look at instances, how do you know whether all three instances of a are the same variable or if it's two variables that just happen to share the same name? You know in which case you are depending on whether the second instance of variable name a is a declaration or a reference. I want to make it easy for tooling (starting with encoder/decoder) to find this out.

thanks, makes sense :)

debugger

how is it related to BinAST? do we expose the raw tree with the IDL to debugger?

That's something we'd like to do, yes.

(maybe off-topic) Does it mean that we're going to add yet another parser? in current plan/implementation, we don't create on-memory AST which directly represents the WebIDL, but convert it into ParseNode tree, or just directly to bytecode.

Yoric commented 6 years ago

(maybe off-topic) Does it mean that we're going to add yet another parser? in current plan/implementation, we don't create on-memory AST which directly represents the WebIDL, but convert it into ParseNode tree, or just directly to bytecode.

Technically, the Firefox devtools debugger already has another parser. We would like to unify its in-memory data structures with the webidl. But yeah, probably off-topic :)

Yoric commented 6 years ago

I realize that there is a different way to think of a subset of this issue.

In a type-safe language, I should be able to define new types, with the same behavior, but different, say, namespaces. Something like newtype Foo = Foo Bar in Haskell or struct Foo(Bar) in Rust.

Let's assume for an instant that we have this newtype in webidl.

What I want is:

newtype IdentifierDeclaration = IdentifierName;
newtype IdentifierReference = IdentifierName;
newtype PropertyKey = IdentifierName;

plus the information on scope boundaries materialized by [Scope] in this PR (see #63 for a variant).

I think that it's reasonable to expect that three different types have different namespaces. In particular, this makes my life easier for tooling.

Questions, part 1

1.1. Do we agree that these three different uses of IdentifierName are distinct and that giving them different types would be meaningful/useful? 1.2. If we do, how do we express it?

Option 1: One hint per newtype

This is PR #60.

typedef [IdentifierDeclaration] IdentifierName IdentifierDeclaration;
typedef [IdentifierReference] IdentifierName IdentifierReference;
typedef [PropertyKey] IdentifierName PropertyKey;

Option 2: One interface per newtype

interface IdentifierDeclaration {
  value: IdentifierName
}

interface IdentifierReference {
  value: IdentifierName
}

interface PropertyKey {
  value: IdentifierName
}

This has the benefit that we don't need to add any kind of hint and the drawback that it makes it possible to put a IdentifierDeclaration/IdentifierReference/PropertyKey in a sum, which may make encoding more complicated. With entropy coding, we should be ok wrt size, though.

A variant could be

[NoTag]
interface IdentifierDeclaration {
  value: IdentifierName
}

[NoTag]
interface IdentifierReference {
  value: IdentifierName
}

[NoTag]
interface PropertyKey {
  value: IdentifierName
}

where [NoTag] means that the interface doesn't have any RTTI/cannot be summed.

Option 3: newtype

Actually introduce a newtype, as some kind of hint, for instance:

typedef [Namespace(IdentifierDeclaration)] IdentifierName IdentifierDeclaration;
typedef [Namespace(IdentifierReference)] IdentifierName IdentifierReference;
typedef [Namespace(PropertyKey)] IdentifierName PropertyKey;

where the label [Namespace(...)] is a hint that tooling may decide to use different representations for these strings.

Option 4: IsIdentifierDeclaration, IsIdentifierReference, IsPropertyKey

This is what I understand of @syg's idea above. I'm not sure how to best specify this, as this is contextual. I'm also not sure whether @syg is actually proposing this or if this is just an idea from the top of his head, so I'll let him explain if he wants to :)

Question, part 2

The [Scope] annotation (or its variant, the ScopeBoundary base interface in PR #63), serves to make it easy for tooling to determine where a scope ends. I have immediate applications in terms of encoding and, as mentioned above in more detail, I can see other near-term applications for obfuscators, devtools, etc.

2.1. Do we agree that a mechanism that makes it explicit for tooling where scope boundaries are crossed would be useful? 2.2. If so, how do we express it?

Option 1: Explicit annotation [Scope]

This is PR #60.

Every interface that crosses a scope boundary is labelled with [Scope].

Drawback: we need a new hint [Scope].

Option 2: Hardcoded interface name ScopeBoundary

This is PR #63.

Every interface that crosses a scope boundary inherits from an interface ScopeBoundary. This requires us to add some semantics to inheritance, but we can probably live with it.

Drawback: this either needs a new hint on ScopeBoundary or makes name "ScopeBoundary" magic.

Option 3: Hardcoded sum type ScopeBoundary

Functionally equivalent to Option 2, except we use a typesum instead of inheritance:

typedef (Module
  or EagerMethod
  or EagerSetter
  or ...
) ScopeBoundary;

The benefit wrt Option 2 is that we don't need to specify inheritance.

Drawback: this either needs a new hint or makes name "ScopeBoundary" magic.

@arai-a, @syg, @tschneidereit, what do you think?

edit Renumbered questions.

syg commented 6 years ago

If you just look at instances, how do you know whether all three instances of a are the same variable or if it's two variables that just happen to share the same name? You know in which case you are depending on whether the second instance of variable name a is a declaration or a reference. I want to make it easy for tooling (starting with encoder/decoder) to find this out.

It is up to the analysis in this case to track declarations across scopes. This is not a trivial thing to encode into the AST. Whether a declaration should introduce a name is at least as complicated as Annex B in sloppy mode, which I thought we agreed to support for maximum adoptability and to align with the "more or less a content encoding" take.

Do we agree that these three different uses of IdentifierName are distinct and that giving them different types would be meaningful/useful?

Answer to question 1:

For property keys, you're interested in the identifiers used on the right hand side of .? As distinguished to what are used in object literals (e.g. { propertyKey: e })? What's the difficulty around inspecting whether an IdentifierName is the member of StaticMemberExpression? I imagine encoders/decoders already have to keep that kind of context around anyways.

From what I understand for what you want to use the declaration/reference annotations for, I disagree that it is useful to distinguish declaration and reference. The main reason is the semantics for scopes isn't trivial, and encoding it as a technically uninterpreted hint is harmful. The main ecma262 spec should remain the source of truth for this semantics. To me, this is the same question as the FAQ's "Why not a semantic graph? Or why not go further? Why not types?"

Answer to question 2:

In alignment with my answer above, I prefer the implementation to implement the predicate itself. Seems a redundant thing to encode in the spec. I'm still very confused about the value add of encoding it -- any implementation can derive whether something is a scope boundary by enumerating all the interfaces that contain an Asserted*, no?

Yoric commented 6 years ago

It is up to the analysis in this case to track declarations across scopes. This is not a trivial thing to encode into the AST. Whether a declaration should introduce a name is at least as complicated as Annex B in sloppy mode, which I thought we agreed to support for maximum adoptability and to align with the "more or less a content encoding" take.

It was my understanding that the asserted scopes were designed to make it simpler.

For property keys, you're interested in the identifiers used on the right hand side of .? As distinguished to what are used in object literals (e.g. { propertyKey: e })?

Not quite. For me, both are LiteralPropertyNames, and I'm fine with them.

I'm realizing that my explanations on this point are a bit confusing because I'm myself a bit confused by the current spec. I see things that, in my mind, are LiteralPropertyNames, but are represented as IdentifierName:

As mentioned earlier, I'm shaky on modules, so please don't read too much in the last 4 from this list.

Is it possible that at least the first two should actually be LiteralPropertyNames, rather than IdentifierName, and that using IdentifierName in these places was just some kind of optimization? If so, part of my problem would be solved by replacing these IdentifierNames with LiteralPropertyName.

Anyway, my problem with these is that I see IdentifierName, and syntactically, these things are pretty clearly not identifiers. I want to handle them differently from identifiers, but they have the same type, and this gets in my way.

What's the difficulty around inspecting whether an IdentifierName is the member of StaticMemberExpression? I imagine encoders/decoders already have to keep that kind of context around anyways.

Currently, the scoping pass, which injects Asserted*, is the only one to actually look at that kind of context. Everything at the level of the container format – which includes the full command-line decoder – is pretty much oblivious to the context. At best, it uses the context for token/entropy prediction, without any understanding of that context.

The problem is not that there is a difficulty, but that there is magic involved. As far as I can tell, this would mean special-casing – in the specs of the container format itself – every node that contains an IdentifierName. Which means that whenever we add a new node with an IdentifierName, we need to change the specs of the container format.

What I'm proposing here is an automated one-stop mechanism that will let us automatically adapt the container format.

Seen from the perspective of the container format, what you were offering with IsPropertyKey, etc. is a non-automated spec that actually keeps such spec changes contained in one place, outside the spec of the container format. I can live with that.

From what I understand for what you want to use the declaration/reference annotations for, I disagree that it is useful to distinguish declaration and reference. The main reason is the semantics for scopes isn't trivial, and encoding it as a technically uninterpreted hint is harmful. The main ecma262 spec should remain the source of truth for this semantics. To me, this is the same question as the FAQ's "Why not a semantic graph? Or why not go further? Why not types?"

Fair enough.

I need to ponder this, because there are very good chances that I want this to happen in the encoding, and I'd loathe to reimplement ecma 262, especially since we already have lots of information in the Asserted scopes.

In alignment with my answer above, I prefer the implementation to implement the predicate itself. Seems a redundant thing to encode in the spec. I'm still very confused about the value add of encoding it -- any implementation can derive whether something is a scope boundary by enumerating all the interfaces that contain an Asserted*, no?

I have a few problems with this:

  1. We actually don't have a list of these Asserted*. Where does the tooling author find out that AssertedBlockScope is part of the list but AssertedDeclaredName isn't? If we head this way, we need to specify that list somewhere, possibly with one of the mechanisms in my previous post, or a variant thereof.
  2. Are we willing to write in stone that "A scope boundary is a node that contains a field which is part of the list of asserted scopes"? I don't find it ideal, but if we do, that's one way to solve the problem. Somehow annotating scope boundaries (see previous post) seems simpler to me.
syg commented 6 years ago

It was my understanding that the asserted scopes were designed to make it simpler.

My intention was to make it possible with the minimum amount of additional, need-to-be-verified information that is consistently interpreted across implementations. The actual semantics don't become any simpler, unfortunately, they are whatever ecma262 says.

A fair question here is: why is the declaration stuff in asserted scopes not too complicated, but use/def chains too complicated? The bar I've been using is a tradeoff among:

  1. Consistency across implementations.
  2. Simplicity of verification and ease of preserving the bijection between ecma262 and binary AST.
  3. The minimal amount needed to enable skipping of lazy functions given how engines are implemented.

The use/def annotations falls short of either 1 or 2 above.

If it's technically a hint but most implementations interprets it as intended (i.e., as encoding use/def information), then we have a de facto standard that's defined by the intersection of observable interop among implementations, which is undesirable. That doesn't meet my bar for 1.

If, OTOH, its intended semantics of use/def semantics is explicit, then verification becomes more complicated than what it is now. Maybe that's actually desired, but I'd like to be convinced some more, because it definitely makes it harder to preserve the bijection. For instance, we'll have a larger set of inexpressible in source text programs that are expressible in binary AST. So, my current thinking is that it doesn't meet my bar for 2 either.

Not quite. For me, both are LiteralPropertyNames, and I'm fine with them.

I'm realizing that my explanations on this point are a bit confusing because I'm myself a bit confused by the current spec. I see things that, in my mind, are LiteralPropertyNames, but are represented as IdentifierName:

They're different because JS doesn't allow the same things in those positions. {1: "foo"} is legal, but x.1 isn't. That's why the StaticMemberExpressions are IdentifierNames.

The problem is not that there is a difficulty, but that there is magic involved. As far as I can tell, this would mean special-casing – in the specs of the container format itself – every node that contains an IdentifierName. Which means that whenever we add a new node with an IdentifierName, we need to change the specs of the container format.

I think you're really asking to push this complexity into the spec. I don't think it's a good idea to even encode this bit in the encoding for the verification reasons I wrote above. I'd want an analysis to derive this result instead of verifying it.

Where does the tooling author find out that AssertedBlockScope is part of the list but AssertedDeclaredName isn't?

I've been assuming that tooling authors understand JS semantics, and that they understand that AssertedBlockScope designates a scope but a name doesn't.