Consider using reference semantics the default for struct/array types

tschneidereit / typed-objects-explainer

Old, outdated explainer for Typed Objects and related specs. See link for new proposal

https://github.com/tschneidereit/proposal-typed-objects

121 stars 6 forks source link

Consider using reference semantics the default for struct/array types #7

Open ghost opened 10 years ago

ghost commented 10 years ago

If people use Typed Objects as "better JS objects", then most non-primitive fields will be references, not embedded sub-objects. Thus, we should consider making reference be the default, that is: var A = new StructType(...); var B = new StructType({a:A}); var a = new A(...); var b = new B({a:a}) assert(b.a === a); and, to get an embedded subobject, you'd get derive an "embedded" type definition: var Point = new StructType({x:int32, y:int32}); var Line = new StructType({x:Point.embed(), y:Point.embed()});

One consequence of this change is that we'd need to overload a type definition's constructor based on whether it was called with 'new' or not. That is 'Point(p)' would either return p (if p was a Point or extended Point) or throw and 'new Point({x:1, y:2})' would construct a new P object. In some sense this is similar to other primitive constructors like String and Number.

dherman commented 10 years ago

So to get the same behavior without reference structs, you would simply make the reference type be object, and then the subobject could itself be a struct:

var Point = new StructType({ x: int32, y: int32 });
var PackedLine = new StructType({ start: Point, end: Point });
var SharedLine = new StructType({ start: object, end: object });

So does this alternate design provide additional performance benefits, because you get richer type information about heap references? In the past I think @andhow said engines could do that automatically so object was sufficient.

As for ergonomics, there's definitely appeal to your proposal since the defaults match the expected use case. It is bothering me a little that the concept of an "embedded struct type" is a little confusing, although I think the existing system already confused people in its inconsistency with existing practice.

Would it maybe make sense for inline struct types not to actually be full types, in that they could only be embedded in other types but not themselves constructed or given methods etc? So

new StructType({ start: Point.embed(), end: Point.embed() })

is legal but not

var EmbeddedPoint = Point.embed();
var x = new EmbeddedPoint(); // error: wat are you doing??

Also, bikeshed: s/embed/inline/? And another bikeshed: just make it a getter that produces a memoized singleton instead of requiring a function call? I.e.:

var PackedLine = new StructType({ start: Point.inline, end: Point.inline });

Dave

ghost commented 10 years ago

With some significant extensions to TI (adding a notion of subtyping), I think SM's TI system could do a pretty good job avoiding these guards in JIT code. asm.js, otoh, would have major problems since it would have to handle the case that an unexpected type was assigned to an object field by outside code. In these cases, JIT code is invalidated/bailed, but the whole point of asm.js is not to have to do that (it's worse for codegen and requires more runtime metadata). Other JS engines don't have sound heap summarization like TI (afaik), so they'd do worse (in fact, if all they had was hidden classes, they'd do much worse on polymoprhic code).

ghost commented 10 years ago

Another non-AOT optimization advantage of typed references is that, even if heap summarization could (with work) could do pretty well on well-typed code, the user may not write well-typed code (or know exactly what qualifies as "well-typed code"). For example, using a single type definition object to create instances in two different contexts which store unrelated types in the same field will lose type precision on the field and thus introduce guards at getprops. With typed references, the programmer who intends to write efficient code has clear rules to follow (with feedback when they make a mistake).

kg commented 10 years ago

Ref-as-default and inline-as-default both make sense depending on your use case.

If your primary use case is C-style structs ('struct' in C#, etc) you want inline-as-default because it's the most obvious layout and it has better performance characteristics. If your primary use case is Java/C#-style heap objects ('class' in C#), then you want member values to be ref, MOST of the time. In C# it's still possible that you might want to have a 'class' that only has inline 'struct' members instead of having all its members be heap references.

I do think that ref-as-default is probably less error prone for the average neophyte programmer. Having foo.x appear to be a typed object but actually be an alias is probably a surprising behavior.

For inline members, like Line.x in your example, is there a trivial way to 'box' them onto the heap? That's the mechanism typically used in C# to make it easy to extract an inline structure onto the GC heap. In most cases it is done automatically:

Line l = ...;
Point p = l.x; // Inline, stored directly on stack. Not a reference.
fixed (Line * pL = &l) {
  Point * pP = &(pL->x); // Reference
}
object oP = l.x; // Boxed - storing a value into a local of type 'object' boxes it
p = (Point)oP; // Unbox

Typed objects could expose a method called 'box' or something that automatically clones them into a heap instance if they are currently pointers into a buffer. If they're already a heap instance it could be a no-op. This would allow you to make passing a typed object across a function boundary, or storing it into a field, to be trivially safe by calling .box() on all the values. Without a mechanism like that I think you need global knowledge of your code to know whether a typed object member is safe to pass around?

nikomatsakis commented 8 years ago

So, @tschneidereit, and I worked through this in some detail recently. Our conclusion was that things ultimately worked more smoothly if we retained the "embedded by default" style of the specification. However, it's also important to be able to have first-class, typed references, so we want to have every type descriptor offer a ref variant that gives you a typed reference. (For example, if Point is a StructType instance, then Point.ref would be a type for references to Point values (versus embedded values)). (This suggests, as an aside, that we should perhaps rename the type for object references to ref, for consistency.) Therefore, I'm inclined to close this issue and leave things as they are.

Note that introducing something like Point.ref will require us to be able to make "incomplete" struct types that can only be used by references (the equivalent of struct Foo; in C), so that we can support recursive types.

phpnode commented 8 years ago

@nikomatsakis I've been trying to come up with a way of dealing with incomplete structs for the lib I'm working on. It's tricky to do it in a safe and fast way using the existing proposals here, so I've started using something similar to Promise's revealing contructor pattern.

If the struct does not contain recursive references, fields can be defined with an object as normal:

const Point = new StructType('Point', {x: int32, y: int32});

If recursive references are required, we pass in a function instead of an object:

const TreeNode = new StructType('TreeNode', TreeNode => {
  return {
    value: int32,
    left: TreeNode,
    right: TreeNode
  };
});

This approach allows recursive references:

let User, Role;
User = new StructType('User', User => {
  Role = new StructType('Role', {
    name: string,
    users: User.vector()
  });
  return {
    name: string;
    roles: Role.vector()
  };
});

But generally speaking I think this is nicer if it's reference-by-default, that's how JS normally behaves and that's what users will expect. Otherwise, if people don't know what they're doing, this has weird effects:

const roles = new Role.Vector([
  {name: 'Admin'},
  {name: 'Guest'}
]);

const alice = new User({name: 'Alice', roles: [roles[0]] });

roles[0].name = 'Administrator';

alice.roles[0].name === roles[0].name; // false

I think users should have to opt in for that.

nikomatsakis commented 8 years ago

@phpnode what I had always assumed we would do is create "incomplete" struct types:

const Tree = new StructType();

The only thing you can legally do with a Point in this state is created a reference type via Tree.ref. Then at some point you can call fulfill. So, for example, to make a binary tree type, I might fulfill the fields like so (note that Tree.ref here executes before fulfill is called):

Tree.fulfill({data: Any, left: Tree.ref, right: Tree.ref})

phpnode commented 8 years ago

@nikomatsakis the issue (for me, in my implementation at least) is that deciding whether or not a struct is finalized/fulfilled in this way seems pretty expensive / convoluted. An unfulfilled struct would poison anything which depends on it, and that dependency tree can be pretty deep. How do I efficiently decide when a particular struct type can be instantiated?

nikomatsakis commented 8 years ago

@phpnode an unfulfilled struct does not, I think, have to poison anything that uses it. You can't instantiate (or embed) the struct until it is fulfilled. You can have types that use Foo.ref, but the only valid value they would be able to supply is null.

kg commented 8 years ago

If recursive references are required, we pass in a function instead of an object:
const TreeNode = new StructType('TreeNode', TreeNode => {
  return {
    value: int32,
    left: TreeNode,
    right: TreeNode
  };
});```

I don't think this self-as-argument function approach covers all the cases that Type.ref does, though. I certainly wouldn't be able to use it to solve my problem. Some definition cycles span 3 or more types and you need a way to get at all of the types in advance, not just self. Forward-declaration is the way you solve that.

Forward declaration also simplifies multiple-phase type initialization, which is something you'll see in runtimes. I.e. declare all the names first, then define all their shapes, in single dumb passes. Using the self-as-argument function approach would require you to painstakingly construct dependency graphs to figure out what order to initialize the types in.

At the point where you're creating an instance of a type all the types it interacts with have been initialized, so I don't think this implies any runtime overhead. It's not 'expensive', and arguably it's not convoluted either, for either API consumer or API implementer. It's just splitting the task of initialization into pieces.

phpnode commented 8 years ago

ok, thanks for your comments, I'm just going to try and implement it in the way you're suggesting here and report back if I run into issues.