Closed Perelandric closed 1 year ago
Ok so from what I see...
The actor is sending a message to another actor but when it hits the end of its run with an empty queue, it sees it's own rc as 0 and so goes through the early delete and o o, when the other actor tries to send back a gc message for having taken possession of the items in the message. Kaboom.
So the question is, why on this particular send is our rc 0 instead of being higher as it should be.
The logic in question to "self delete" starts here: https://github.com/ponylang/ponyc/blob/main/src/libponyrt/actor/actor.c#L444
Here's a more minimal repro which segfaults without using partial application (it uses an explicit class val
to hold the actor) and a bit less of the bouncing around between different functions (the dispatch
behavior directly sends the message that goes boom instead of asking the Payload
class to do it):
use "collections"
actor _BoomActor
be dispatch(request: Payload) =>
request.holder = _BoomActorHolder(this)
boom_behavior(consume request, Payload)
be boom_behavior(request: Payload val, response: Payload val) =>
None
class val _BoomActorHolder
let boom_actor: _BoomActor
new val create(boom_actor': _BoomActor) => boom_actor = boom_actor'
class iso Payload
var holder: (_BoomActorHolder | None) = None
actor Main
new create(env: Env) =>
let t = Test
for i in Range(0, 1_000_000) do
t.do_it()
end
actor Test
be do_it() =>
_BoomActor.dispatch(Payload)
I have some ideas about what is going on here but need to discuss further with Sean.
@jemc and I have a plan to address. There will be a performance impact, but correctness trumps performance. Joe will also write up a mitigation that can help offset some performance impact.
To summarize what we discussed in the Zulip thread:
The current Pony runtime has a correctness bug due to what is usually a valid optimization, but in this case is not.
Specifically, the Pony runtime traces immutable (val
) objects shallowly - that is, it skips tracing of fields within such objects. This saves time by reducing how much tracing has to happen, and it is described as a safe optimization in the ORCA paper, because the "outer" val
object acts as an upper bound on the lifetime of the "inner" objects referred to by its fields.
While that optimization is safe within the limited scope of what was considered in the ORCA paper, the reasoning ignores the counting of actor references (which was outside the scope of the ORCA paper).
If a val
object has references (either directly as its fields, or transitively as fields of its fields) to any actors, those actors need to be traced. Hence, for such an object we cannot keep this optimization in place.
But for val
objects which are known via static analysis to not possibly refer to any actors, this optimization is safe and we'd like to keep it in place if possible, to keep the part of the benefit of this optimization for some workloads.
As such, we want to add a new kind of static analysis to the compiler that can classify any given data type as "definitely contains no actor references" or "may possibly contain an actor reference". If we can mark an type with the internal designation contains_no_actors
, then it is valid for that type to participate in the above mentioned optimization, and the compiler should generate a trace function for that type which uses the optimized path when immutable. Otherwise, it would need to take a new pessimistic path for the sake of correctness, tracing it at runtime so that any actors it may contain are traced.
To determine if a type should be marked as contains_no_actors
:
If any field type is an actor
type, or a composite type (tuple, union, intersection) referring to an actor
type => return false
.
If any field type refers (possibly within a composite type) to a type which is not marked contains_no_actors
=> return false
.
contains_no_actors
, and every time we recurse into a type we push it onto that list, such that we will surely terminate and no type will mark itself as contains_no_actors
without some cause which is not itself.If the type under consideration is an abstract type (such as an interface
or trait
), and reachability analysis shows that the abstract type has in the reachable program subsumed any type which is not marked contains_no_actors
=> return false
.
Otherwise, the type has been shown to not possibly contain any actors => return true
.
There is a working fix for this at https://github.com/ponylang/ponyc/pull/4256/files. Note that it has a large performance impact at the moment because all vals that previously weren't traced on send are currently always traced. We will need to improve that so we only trace objects that might contain a reference to an actor.
_EDIT:_ Skip down to https://github.com/ponylang/ponyc/issues/1118#issuecomment-238431412 to see the most reduced example of the issue.
I gutted the HTTP server down to this, which I think is a reproduction of the seg fault in #937. I kept the original type names so that they could be somewhat related back to that package if necessary. Seems to have something to do with the partial function
this~answer()
. At least if I interrupt anything after that assignment, it the seg fault disappeared.This could probably be further reduced, but I wanted to maintain at least a slight semblance to the original code... and it's the middle of the night so I'm going to :sleeping:.