Segmentation fault when actor receives a reference to itself via a class created in a different actor.

Perelandric commented 8 years ago

_EDIT:_ Skip down to https://github.com/ponylang/ponyc/issues/1118#issuecomment-238431412 to see the most reduced example of the issue.

I gutted the HTTP server down to this, which I think is a reproduction of the seg fault in #937. I kept the original type names so that they could be somewhat related back to that package if necessary. Seems to have something to do with the partial function this~answer(). At least if I interrupt anything after that assignment, it the seg fault disappeared.

interface val ResponseHandler
  fun val apply(request: Payload val, response: Payload val): Any

interface val RequestHandler
  fun val apply(request: Payload): Any

primitive Handle
  fun val apply(request: Payload) =>
    (consume request).respond(Payload)

actor _ServerConnection
  let _handler: RequestHandler

  new create(h: RequestHandler) => _handler = h

  be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    _handler(consume request)

  be answer(request: Payload val, response: Payload val) =>
    None

class iso Payload
  var handler: (ResponseHandler | None) = None

  fun iso respond(response': Payload) =>
    try
      let h = (handler) as ResponseHandler
      h(consume this, consume response')
    end

actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 10_000) do
      t.do_it()
    end

actor Test
  be do_it() =>
    _ServerConnection(Handle).dispatch(Payload)

This could probably be further reduced, but I wanted to maintain at least a slight semblance to the original code... and it's the middle of the night so I'm going to :sleeping:.

SeanTAllen commented 2 years ago

Ok so from what I see...

The actor is sending a message to another actor but when it hits the end of its run with an empty queue, it sees it's own rc as 0 and so goes through the early delete and o o, when the other actor tries to send back a gc message for having taken possession of the items in the message. Kaboom.

So the question is, why on this particular send is our rc 0 instead of being higher as it should be.

The logic in question to "self delete" starts here: https://github.com/ponylang/ponyc/blob/main/src/libponyrt/actor/actor.c#L444

jemc commented 2 years ago

Here's a more minimal repro which segfaults without using partial application (it uses an explicit class val to hold the actor) and a bit less of the bouncing around between different functions (the dispatch behavior directly sends the message that goes boom instead of asking the Payload class to do it):


use "collections"

actor _BoomActor
  be dispatch(request: Payload) =>
    request.holder = _BoomActorHolder(this)
    boom_behavior(consume request, Payload)

  be boom_behavior(request: Payload val, response: Payload val) =>
    None

class val _BoomActorHolder
  let boom_actor: _BoomActor
  new val create(boom_actor': _BoomActor) => boom_actor = boom_actor'

class iso Payload
  var holder: (_BoomActorHolder | None) = None

actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 1_000_000) do
      t.do_it()
    end

actor Test
  be do_it() =>
    _BoomActor.dispatch(Payload)

jemc commented 2 years ago

I have some ideas about what is going on here but need to discuss further with Sean.

SeanTAllen commented 2 years ago

@jemc and I have a plan to address. There will be a performance impact, but correctness trumps performance. Joe will also write up a mitigation that can help offset some performance impact.

jemc commented 2 years ago

To summarize what we discussed in the Zulip thread:

The current Pony runtime has a correctness bug due to what is usually a valid optimization, but in this case is not.

Specifically, the Pony runtime traces immutable (val) objects shallowly - that is, it skips tracing of fields within such objects. This saves time by reducing how much tracing has to happen, and it is described as a safe optimization in the ORCA paper, because the "outer" val object acts as an upper bound on the lifetime of the "inner" objects referred to by its fields.

While that optimization is safe within the limited scope of what was considered in the ORCA paper, the reasoning ignores the counting of actor references (which was outside the scope of the ORCA paper).

If a val object has references (either directly as its fields, or transitively as fields of its fields) to any actors, those actors need to be traced. Hence, for such an object we cannot keep this optimization in place.

But for val objects which are known via static analysis to not possibly refer to any actors, this optimization is safe and we'd like to keep it in place if possible, to keep the part of the benefit of this optimization for some workloads.

As such, we want to add a new kind of static analysis to the compiler that can classify any given data type as "definitely contains no actor references" or "may possibly contain an actor reference". If we can mark an type with the internal designation contains_no_actors, then it is valid for that type to participate in the above mentioned optimization, and the compiler should generate a trace function for that type which uses the optimized path when immutable. Otherwise, it would need to take a new pessimistic path for the sake of correctness, tracing it at runtime so that any actors it may contain are traced.

To determine if a type should be marked as contains_no_actors:

If any field type is an actor type, or a composite type (tuple, union, intersection) referring to an actor type => return false.
If any field type refers (possibly within a composite type) to a type which is not marked contains_no_actors => return false.
- This implies we need to be able to do this analysis recursively, and it also implies we need a mechanism to break self-recursion.
- We can likely use a similar mechanism to that used in subtype checking; that is, we can have an "assumptions list" for things that we are temporarily assuming to be marked contains_no_actors, and every time we recurse into a type we push it onto that list, such that we will surely terminate and no type will mark itself as contains_no_actors without some cause which is not itself.
If the type under consideration is an abstract type (such as an interface or trait), and reachability analysis shows that the abstract type has in the reachable program subsumed any type which is not marked contains_no_actors => return false.
- This also implies recursive analysis - this time recursing into subsumed types instead of into fields.
Otherwise, the type has been shown to not possibly contain any actors => return true.

SeanTAllen commented 1 year ago

There is a working fix for this at https://github.com/ponylang/ponyc/pull/4256/files. Note that it has a large performance impact at the moment because all vals that previously weren't traced on send are currently always traced. We will need to improve that so we only trace objects that might contain a reference to an actor.

SeanTAllen commented 1 year ago

Closed by https://github.com/ponylang/ponyc/pull/4256

ponylang / ponyc

Segmentation fault when actor receives a reference to itself via a class created in a different actor. #1118