Closed Yoric closed 6 years ago
Looks like, Identifier inside Asserted* is IdentifierDeclaration, and otherwise IdentifierReference.
I think so. I'm not 100% certain this is going to remain true forever, though. What do you think?
as long as we're using different interface hierarchy for Node and Asserted*, the distinction is already known without introducing IdentifierDeclaration/IdentifierReference. (this would depend on #56)
I don't follow.
Looks like, Identifier inside Asserted* is IdentifierDeclaration, and otherwise IdentifierReference.
I think so. I'm not 100% certain this is going to remain true forever, though. What do you think?
all declaration should be covered by Asserted*
, so at least it should be true for declaration.
if we use reference for declaration syntax (given it refers to the declaration in Asserted*
), that should also be true for reference.
as long as we're using different interface hierarchy for Node and Asserted*, the distinction is already known without introducing IdentifierDeclaration/IdentifierReference. (this would depend on #56)
I don't follow.
I meant, just using single Identifier
and treating it as IdenfitierDeclaration
if it appears under Asserted*
, or IdentifierReference
if it appears elsewhere would do the same thing.
so I'm not sure how much this helps, compared to [Scope] which is not immediately obvious from the structure.
Looks like, Identifier inside Asserted* is IdentifierDeclaration, and otherwise IdentifierReference.
I think so. I'm not 100% certain this is going to remain true forever, though. What do you think?
all declaration should be covered by
Asserted*
, so at least it should be true for declaration. if we use reference for declaration syntax (given it refers to the declaration inAsserted*
), that should also be true for reference.
I can imagine one future in which we end storing relationships between identifiers (e.g. for alias analysis purposes), in which case the declaration of an Asserted*
could possibly contain a reference to an existing identifier.
I'm not holding that this is a likely future, so if you think that it's unlikely and/or can be hinted in a different manner, I'm not going to insist on that bit :)
as long as we're using different interface hierarchy for Node and Asserted*, the distinction is already known without introducing IdentifierDeclaration/IdentifierReference. (this would depend on #56)
I don't follow.
I meant, just using single
Identifier
and treating it asIdenfitierDeclaration
if it appears underAsserted*
, orIdentifierReference
if it appears elsewhere would do the same thing. so I'm not sure how much this helps, compared to [Scope] which is not immediately obvious from the structure.
Ok, I can understand that part :)
If we do this, we make the Asserted*
magical wrt tokenization: we cannot, in general, go from an AST (as defined in the webidl) to a binary representation without hardcoding the fact that Asserted*
are special nodes.
To do it without magic, we need to introduce a hint [AssertedScope]
around all Asserted*
to indicate that any Identifier
in an [AssertedScope]
is in fact an identifier definition. Or, alternatively, we turn these identifiers into IdentifierReference
in the first place (with the label [IdentifierReference]
). So far, it looks to me that the latter is more natural.
What do you think?
After discussing with @arai-a, I have attempted to implement a Rust version that only has [IdentifierName]
– rather than [IdentifierReference]
and [IdentifierDeclaration]
– but that ought to be able to determine, when inspecting an [IdentifierName]
, whether it's a reference (as visible in the user-level syntax) or a declaration (as visible in the Asserted*Scope
).
[IdentifierReference]
/[IdentifierDeclaration]
or in a more roundabout way, by adding an equivalent hint to Asserted*Scope
.@arai-a, @syg, any thoughts?
I have added to the PR a patch that kind-of implements IdentifierReference
vs. IdentifierDeclaration
without support in webidl. This works in the encoder, but it doesn't in the decoder, because we don't have the magic to make the decoder understand that it should switch from IdentifierReference
to IdentifierDeclaration
when it enters an Asserted*Scope
.
As far as I can tell, this would require either labeling [IdentifierReference]
/[IdentifierDeclaration]
, as I've done above, or adding a label such as [Declarations]
to Asserted*Scope
and specifying that we're never, ever going to put an IdentifierReference
in an Asserted*Scope
.
Competing PR: #62.
I can imagine one future in which we end storing relationships between identifiers (e.g. for alias analysis purposes), in which case the declaration of an Asserted* could possibly contain a reference to an existing identifier.
I'm not holding that this is a likely future, so if you think that it's unlikely and/or can be hinted in a different manner, I'm not going to insist on that bit :)
I still believe fairly strongly that we shouldn't encode any information that is more complex to verify than "does this particular AST node (e.g. the node corresponding to a var x
) exist?'
I still believe fairly strongly that we shouldn't encode any information that is more complex to verify than "does this particular AST node (e.g. the node corresponding to a var x) exist?'
I have no agenda here, it was just an example.
@Yoric I'm not sure I completely understand the goal here. Is this machinery (and the competing one in #62) so that Identifier
nodes in Asserted*
are distinct from Identifiers
in the main AST?
Why is any explicit hint necessary? Both the encoder and the decoder should know the current AST node it's processing at any one time. It seems to me the logic that's encoded in this PR for particular fields of Asserted*
and other nodes should be an implementation detail in the encoder/decoder when processing those nodes. For instance, when looking at a FormalParameter
, you know that it's using a binding.
Edit: I'd like to avoid semantically empty hints at the spec level that encode complex relationships (like use-def in this case) -- I believe they'll become de facto implementation-defined behavior, which would be bad for interop.
I'll try and be as clear as I can at 1am :)
Scenario: I'm writing tooling. In this case, an encoder and a decoder.
While looking at the code, I, as a tooling developer, see three kinds of identifier names:
Asserted*
, so they are part of the metalevel;First, can I assume that we agree that 1. deserves its own distinct type? As far as I can tell, there is no relationship between {1.} and {2., 3.}, except that they share tokens.
Now, if I understand what you write, we agree that tooling should be aware of 2. and 3. Examples of use:
If we agree, the question is how do we make tooling aware of the relationship between 2. and 3.:
IdentifierName
children are part of 2, vs. all other nodes, whose IdentifierName
children are part of 3. I'm not sure how the tooling developer is expected to know this list of nodes, and update their tooling for future versions of BinAST, possibly by looking at some bits of the semantics? Do you have a clear idea?[IdentifierDeclaration]
(this PR), or [AssertedScope]
(#62), or another mechanism. Either has the benefit that we can derive the information automatically from the spec.I feel that Option 1 is a spec smell. In particular, I feel that it would require me to hardcode in the container format a list of nodes. Hence these PRs.
Thanks for the detailed answer.
I have the opposite intuition about spec smell: having these otherwise semantics-free hints is a spec smell to me. From experience of working with web standards, I really believe something that isn't fully specified with non-trivial implied semantics is a bad idea.
If I understand the concern correctly, would the tooling need also be addressed by fully specifying an actual predicate in the spec, but leaving its use to be up to the implementers? That is, how do you feel about specifying a predicate like IsIdentifierUse
and IsIdentifierDeclaration
that are Node -> Boolean
? That way, you have the actual predicate you want fully defined not as a hint, and if tooling so wishes to use them, they can depend on something that is fully defined.
Edit: Some more thoughts. My proposed solution is isomorphic to yours, so the only real difference is what is actually encoded: what I'm proposing is nothing be concretely encoded, but there is a normative list of nodes the spec provides for the classification you want.
All in all I would still prefer to leave those predicates up to the implementations to define. The tooling developer is expected to know the nodes, because, well, they have to know the nodes and what JavaScript programs mean to do static analysis. It doesn't seem realistic (or a good idea) for analysis authors to write tooling without understanding JS semantics. AssertedScopes, while usable for extracting semantic information, encode a simple-to-verify, purely syntactic information: that there exists some declaration nodes.
Binary AST should remain encoding of surface syntax, especially with the "it's just more or less a surface encoding" tack that we'd like to try with the CDNs. This starts to feel like it's going beyond that.
@arai-a, @tschneidereit What're your thoughts?
DeBruijn indexing
I still don't understand why it needs to be separate type for this case. "looking up the name and use it if exists, or define if not" would work I think? (if we want validation, it should know more about the JS semantics)
Then, given this is about compression, which will eventually become a part of the spec (with container format), we could defer this until the final compression algorithm/format is ready? (with any options, the same information will be embedded into the spec, as long as we use the info in the compression)
obfuscator hypothetical ahead of time constant propagators, inliners, etc.
having extra info about the scope won't help much for those use cases. so, stripping scope information beforehand and/or (re-)generating the scope information after the modification will be better, for simplicity.
debugger
how is it related to BinAST? do we expose the raw tree with the IDL to debugger?
Option 1
just introducing a base interface for Asserted*
and hardcode its name in the implementation will work. no need to hardcode the list of all interfaces in the source. just enumerating all sub-interfaces of the base class (from webidl) should do. (auto-generated code will have the same information anyway)
the base interface could solve https://github.com/binast/ecmascript-binary-ast/issues/56 with the option 1 there (https://github.com/binast/ecmascript-binary-ast/issues/56#issuecomment-419504989 )
Option 2
same as above. maybe introducing another annotation for the base interface? (which just declares that it's not part of actual AST) but in any case implementation should hardcode which annotation it is.
actually I was thinking this, when we were talking about https://github.com/binast/ecmascript-binary-ast/pull/62 , not adding the annotation to all interfaces.
I still don't understand why it needs to be separate type for this case. "looking up the name and use it if exists, or define if not" would work I think?
Well, consider
Scope 1:
instance of variable name a
Scope 2:
instance of variable name b
instance of variable name a
instance of variable name a
If you just look at instances, how do you know whether all three instances of a
are the same variable or if it's two variables that just happen to share the same name? You know in which case you are depending on whether the second instance of variable name a
is a declaration or a reference. I want to make it easy for tooling (starting with encoder/decoder) to find this out.
Then, given this is about compression, which will eventually become a part of the spec (with container format), we could defer this until the final compression algorithm/format is ready?
I think we meant to specify the container format somewhat separately from the spec. Right, @syg?
obfuscator hypothetical ahead of time constant propagators, inliners, etc.
having extra info about the scope won't help much for those use cases. so, stripping scope information beforehand and/or (re-)generating the scope information after the modification will be better, for simplicity.
Fair enough. Of course, they still need to be able to answer the question I gave as example on top of this reply.
debugger
how is it related to BinAST? do we expose the raw tree with the IDL to debugger?
That's something we'd like to do, yes.
Option 1
just introducing a base interface for
Asserted*
and hardcode its name in the implementation will work. no need to hardcode the list of all interfaces in the source. just enumerating all sub-interfaces of the base class (from webidl) should do. (auto-generated code will have the same information anyway) the base interface could solve #56 with the option 1 there (#56 (comment) )
Yes, this would work. I'm a bit wary of hardcoding, but hardcoding a single name is definitely better than hardcoding a changing list.
Option 2
same as above. maybe introducing another annotation for the base interface? (which just declares that it's not part of actual AST) but in any case implementation should hardcode which annotation it is.
Works for me. I have the feeling that hardcoding an annotation is exactly what annotations are for, so I'm ok with that :)
If I understand the concern correctly, would the tooling need also be addressed by fully specifying an actual predicate in the spec, but leaving its use to be up to the implementers? That is, how do you feel about specifying a predicate like
IsIdentifierUse
andIsIdentifierDeclaration
that areNode -> Boolean
? That way, you have the actual predicate you want fully defined not as a hint, and if tooling so wishes to use them, they can depend on something that is fully defined.
It's a bit trickier to auto-generate/track changes between versions, but yes, that would work.
I still don't understand why it needs to be separate type for this case. "looking up the name and use it if exists, or define if not" would work I think?
Well, consider
Scope 1: instance of variable name a Scope 2: instance of variable name b instance of variable name a instance of variable name a
If you just look at instances, how do you know whether all three instances of
a
are the same variable or if it's two variables that just happen to share the same name? You know in which case you are depending on whether the secondinstance of variable name a
is a declaration or a reference. I want to make it easy for tooling (starting with encoder/decoder) to find this out.
thanks, makes sense :)
debugger
how is it related to BinAST? do we expose the raw tree with the IDL to debugger?
That's something we'd like to do, yes.
(maybe off-topic)
Does it mean that we're going to add yet another parser?
in current plan/implementation, we don't create on-memory AST which directly represents the WebIDL, but convert it into ParseNode
tree, or just directly to bytecode.
(maybe off-topic) Does it mean that we're going to add yet another parser? in current plan/implementation, we don't create on-memory AST which directly represents the WebIDL, but convert it into
ParseNode
tree, or just directly to bytecode.
Technically, the Firefox devtools debugger already has another parser. We would like to unify its in-memory data structures with the webidl. But yeah, probably off-topic :)
I realize that there is a different way to think of a subset of this issue.
In a type-safe language, I should be able to define new types, with the same behavior, but different, say, namespaces. Something like newtype Foo = Foo Bar
in Haskell or struct Foo(Bar)
in Rust.
Let's assume for an instant that we have this newtype
in webidl.
What I want is:
newtype IdentifierDeclaration = IdentifierName;
newtype IdentifierReference = IdentifierName;
newtype PropertyKey = IdentifierName;
plus the information on scope boundaries materialized by [Scope]
in this PR (see #63 for a variant).
I think that it's reasonable to expect that three different types have different namespaces. In particular, this makes my life easier for tooling.
1.1. Do we agree that these three different uses of IdentifierName
are distinct and that giving them different types would be meaningful/useful?
1.2. If we do, how do we express it?
This is PR #60.
typedef [IdentifierDeclaration] IdentifierName IdentifierDeclaration;
typedef [IdentifierReference] IdentifierName IdentifierReference;
typedef [PropertyKey] IdentifierName PropertyKey;
interface IdentifierDeclaration {
value: IdentifierName
}
interface IdentifierReference {
value: IdentifierName
}
interface PropertyKey {
value: IdentifierName
}
This has the benefit that we don't need to add any kind of hint and the drawback that it makes it possible to put a IdentifierDeclaration
/IdentifierReference
/PropertyKey
in a sum, which may make encoding more complicated. With entropy coding, we should be ok wrt size, though.
A variant could be
[NoTag]
interface IdentifierDeclaration {
value: IdentifierName
}
[NoTag]
interface IdentifierReference {
value: IdentifierName
}
[NoTag]
interface PropertyKey {
value: IdentifierName
}
where [NoTag]
means that the interface doesn't have any RTTI/cannot be summed.
newtype
Actually introduce a newtype
, as some kind of hint, for instance:
typedef [Namespace(IdentifierDeclaration)] IdentifierName IdentifierDeclaration;
typedef [Namespace(IdentifierReference)] IdentifierName IdentifierReference;
typedef [Namespace(PropertyKey)] IdentifierName PropertyKey;
where the label [Namespace(...)]
is a hint that tooling may decide to use different representations for these strings.
IsIdentifierDeclaration
, IsIdentifierReference
, IsPropertyKey
This is what I understand of @syg's idea above. I'm not sure how to best specify this, as this is contextual. I'm also not sure whether @syg is actually proposing this or if this is just an idea from the top of his head, so I'll let him explain if he wants to :)
The [Scope]
annotation (or its variant, the ScopeBoundary
base interface in PR #63), serves to make it easy for tooling to determine where a scope ends. I have immediate applications in terms of encoding and, as mentioned above in more detail, I can see other near-term applications for obfuscators, devtools, etc.
2.1. Do we agree that a mechanism that makes it explicit for tooling where scope boundaries are crossed would be useful? 2.2. If so, how do we express it?
[Scope]
This is PR #60.
Every interface that crosses a scope boundary is labelled with [Scope]
.
Drawback: we need a new hint [Scope]
.
ScopeBoundary
This is PR #63.
Every interface that crosses a scope boundary inherits from an interface ScopeBoundary
. This requires us to add some semantics to inheritance, but we can probably live with it.
Drawback: this either needs a new hint on ScopeBoundary
or makes name "ScopeBoundary"
magic.
ScopeBoundary
Functionally equivalent to Option 2, except we use a typesum instead of inheritance:
typedef (Module
or EagerMethod
or EagerSetter
or ...
) ScopeBoundary;
The benefit wrt Option 2 is that we don't need to specify inheritance.
Drawback: this either needs a new hint or makes name "ScopeBoundary"
magic.
@arai-a, @syg, @tschneidereit, what do you think?
edit Renumbered questions.
If you just look at instances, how do you know whether all three instances of a are the same variable or if it's two variables that just happen to share the same name? You know in which case you are depending on whether the second instance of variable name a is a declaration or a reference. I want to make it easy for tooling (starting with encoder/decoder) to find this out.
It is up to the analysis in this case to track declarations across scopes. This is not a trivial thing to encode into the AST. Whether a declaration should introduce a name is at least as complicated as Annex B in sloppy mode, which I thought we agreed to support for maximum adoptability and to align with the "more or less a content encoding" take.
Do we agree that these three different uses of IdentifierName are distinct and that giving them different types would be meaningful/useful?
Answer to question 1:
For property keys, you're interested in the identifiers used on the right hand side of .
? As distinguished to what are used in object literals (e.g. { propertyKey: e }
)? What's the difficulty around inspecting whether an IdentifierName
is the member of StaticMemberExpression
? I imagine encoders/decoders already have to keep that kind of context around anyways.
From what I understand for what you want to use the declaration/reference annotations for, I disagree that it is useful to distinguish declaration and reference. The main reason is the semantics for scopes isn't trivial, and encoding it as a technically uninterpreted hint is harmful. The main ecma262 spec should remain the source of truth for this semantics. To me, this is the same question as the FAQ's "Why not a semantic graph? Or why not go further? Why not types?"
Answer to question 2:
In alignment with my answer above, I prefer the implementation to implement the predicate itself. Seems a redundant thing to encode in the spec. I'm still very confused about the value add of encoding it -- any implementation can derive whether something is a scope boundary by enumerating all the interfaces that contain an Asserted*
, no?
It is up to the analysis in this case to track declarations across scopes. This is not a trivial thing to encode into the AST. Whether a declaration should introduce a name is at least as complicated as Annex B in sloppy mode, which I thought we agreed to support for maximum adoptability and to align with the "more or less a content encoding" take.
It was my understanding that the asserted scopes were designed to make it simpler.
For property keys, you're interested in the identifiers used on the right hand side of
.
? As distinguished to what are used in object literals (e.g.{ propertyKey: e }
)?
Not quite. For me, both are LiteralPropertyName
s, and I'm fine with them.
I'm realizing that my explanations on this point are a bit confusing because I'm myself a bit confused by the current spec. I see things that, in my mind, are LiteralPropertyName
s, but are represented as IdentifierName
:
StaticMemberAssignmentTarget::property
;StaticMemberExpression::property
;ImportSpecifier::name
;ExportSpecifier::name
;ExportSpecifier::exportedName
;ExportLocalSpecifier::exportedName
;As mentioned earlier, I'm shaky on modules, so please don't read too much in the last 4 from this list.
Is it possible that at least the first two should actually be LiteralPropertyName
s, rather than IdentifierName
, and that using IdentifierName
in these places was just some kind of optimization? If so, part of my problem would be solved by replacing these IdentifierName
s with LiteralPropertyName
.
Anyway, my problem with these is that I see IdentifierName
, and syntactically, these things are pretty clearly not identifiers. I want to handle them differently from identifiers, but they have the same type, and this gets in my way.
What's the difficulty around inspecting whether an
IdentifierName
is the member ofStaticMemberExpression
? I imagine encoders/decoders already have to keep that kind of context around anyways.
Currently, the scoping pass, which injects Asserted*
, is the only one to actually look at that kind of context. Everything at the level of the container format – which includes the full command-line decoder – is pretty much oblivious to the context. At best, it uses the context for token/entropy prediction, without any understanding of that context.
The problem is not that there is a difficulty, but that there is magic involved. As far as I can tell, this would mean special-casing – in the specs of the container format itself – every node that contains an IdentifierName
. Which means that whenever we add a new node with an IdentifierName
, we need to change the specs of the container format.
What I'm proposing here is an automated one-stop mechanism that will let us automatically adapt the container format.
Seen from the perspective of the container format, what you were offering with IsPropertyKey
, etc. is a non-automated spec that actually keeps such spec changes contained in one place, outside the spec of the container format. I can live with that.
From what I understand for what you want to use the declaration/reference annotations for, I disagree that it is useful to distinguish declaration and reference. The main reason is the semantics for scopes isn't trivial, and encoding it as a technically uninterpreted hint is harmful. The main ecma262 spec should remain the source of truth for this semantics. To me, this is the same question as the FAQ's "Why not a semantic graph? Or why not go further? Why not types?"
Fair enough.
I need to ponder this, because there are very good chances that I want this to happen in the encoding, and I'd loathe to reimplement ecma 262, especially since we already have lots of information in the Asserted scopes.
In alignment with my answer above, I prefer the implementation to implement the predicate itself. Seems a redundant thing to encode in the spec. I'm still very confused about the value add of encoding it -- any implementation can derive whether something is a scope boundary by enumerating all the interfaces that contain an Asserted*, no?
I have a few problems with this:
Asserted*
. Where does the tooling author find out that AssertedBlockScope
is part of the list but AssertedDeclaredName
isn't? If we head this way, we need to specify that list somewhere, possibly with one of the mechanisms in my previous post, or a variant thereof.It was my understanding that the asserted scopes were designed to make it simpler.
My intention was to make it possible with the minimum amount of additional, need-to-be-verified information that is consistently interpreted across implementations. The actual semantics don't become any simpler, unfortunately, they are whatever ecma262 says.
A fair question here is: why is the declaration stuff in asserted scopes not too complicated, but use/def chains too complicated? The bar I've been using is a tradeoff among:
The use/def annotations falls short of either 1 or 2 above.
If it's technically a hint but most implementations interprets it as intended (i.e., as encoding use/def information), then we have a de facto standard that's defined by the intersection of observable interop among implementations, which is undesirable. That doesn't meet my bar for 1.
If, OTOH, its intended semantics of use/def semantics is explicit, then verification becomes more complicated than what it is now. Maybe that's actually desired, but I'd like to be convinced some more, because it definitely makes it harder to preserve the bijection. For instance, we'll have a larger set of inexpressible in source text programs that are expressible in binary AST. So, my current thinking is that it doesn't meet my bar for 2 either.
Not quite. For me, both are LiteralPropertyNames, and I'm fine with them.
I'm realizing that my explanations on this point are a bit confusing because I'm myself a bit confused by the current spec. I see things that, in my mind, are LiteralPropertyNames, but are represented as IdentifierName:
They're different because JS doesn't allow the same things in those positions. {1: "foo"}
is legal, but x.1
isn't. That's why the StaticMemberExpression
s are IdentifierName
s.
The problem is not that there is a difficulty, but that there is magic involved. As far as I can tell, this would mean special-casing – in the specs of the container format itself – every node that contains an IdentifierName. Which means that whenever we add a new node with an IdentifierName, we need to change the specs of the container format.
I think you're really asking to push this complexity into the spec. I don't think it's a good idea to even encode this bit in the encoding for the verification reasons I wrote above. I'd want an analysis to derive this result instead of verifying it.
Where does the tooling author find out that AssertedBlockScope is part of the list but AssertedDeclaredName isn't?
I've been assuming that tooling authors understand JS semantics, and that they understand that AssertedBlockScope designates a scope but a name doesn't.
I'm not terribly familiar with ES6 modules, so if someone more familiar than me could take a look in particular at that part of the patch and tell me if I've done something meaningless, I'm very interested.
This PR replaces #42.