Segmentation fault when actor receives a reference to itself via a class created in a different actor.

Perelandric commented 8 years ago

_EDIT:_ Skip down to https://github.com/ponylang/ponyc/issues/1118#issuecomment-238431412 to see the most reduced example of the issue.

I gutted the HTTP server down to this, which I think is a reproduction of the seg fault in #937. I kept the original type names so that they could be somewhat related back to that package if necessary. Seems to have something to do with the partial function this~answer(). At least if I interrupt anything after that assignment, it the seg fault disappeared.

interface val ResponseHandler
  fun val apply(request: Payload val, response: Payload val): Any

interface val RequestHandler
  fun val apply(request: Payload): Any

primitive Handle
  fun val apply(request: Payload) =>
    (consume request).respond(Payload)

actor _ServerConnection
  let _handler: RequestHandler

  new create(h: RequestHandler) => _handler = h

  be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    _handler(consume request)

  be answer(request: Payload val, response: Payload val) =>
    None

class iso Payload
  var handler: (ResponseHandler | None) = None

  fun iso respond(response': Payload) =>
    try
      let h = (handler) as ResponseHandler
      h(consume this, consume response')
    end

actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 10_000) do
      t.do_it()
    end

actor Test
  be do_it() =>
    _ServerConnection(Handle).dispatch(Payload)

This could probably be further reduced, but I wanted to maintain at least a slight semblance to the original code... and it's the middle of the night so I'm going to :sleeping:.

Perelandric commented 8 years ago

Here's a further reduced example:

interface val ResponseHandler
  fun val apply(request: Payload val): Any

actor _ServerConnection
  be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    try
      let h = request.handler as ResponseHandler
      h(consume request)
    end

  be answer(request: Payload val) => None

class iso Payload
  var handler: (ResponseHandler | None) = None

actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 10_000) do
      t.do_it()
    end

actor Test
  be do_it() =>
    _ServerConnection.dispatch(Payload)

If I change the dispatch method to this, the seg fault is gone:

  be dispatch(request: Payload) =>
    let x: (ResponseHandler | None) = recover this~answer() end
    try
      let h = x as ResponseHandler
      h(consume request)
    end

but if I leave off the : (ResponseHandler | None) on the x declaration, it complains of capture violations. If I then make it recover to val, it compiles and the seg fault is again gone.

    let x = recover val this~answer() end

Or if I simply set request.handler = None after the let h assignment, the seg fault is gone.

actor _ServerConnection
  be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    try
      let h = request.handler as ResponseHandler
      request.handler = None
      h(consume request)
    end

So it seems like it has something to do with sending the _ServerConnection into its own .answer() method via the partial function that captures this and gets assigned to the Payload.

Unfortunately, adding handler = None in the respond method in net/http/payload.pony does not fix it. :confused:

Perelandric commented 8 years ago

Given that last bit of info, I can reduce it just a bit more:

use "collections"

interface val ResponseHandler
  fun val apply(request: Payload val): Any

actor _ServerConnection
  be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    this.answer(consume request)

  be answer(request: Payload val) => None

class iso Payload
  var handler: (ResponseHandler | None) = None

actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 10_000) do
      t.do_it()
    end

actor Test
  be do_it() =>
    _ServerConnection.dispatch(Payload)

Note that if I create the Payload directly in dispatch instead of passing it from do_it(), we don't get the seg fault. Same if I eliminate _ServerConnection and put all its methods right in Test.

I tried to see if I could create the seg fault by passing the partial function directly to answer instead of via Payload, but wasn't able to get the type quite right for the parameter.

Perelandric commented 8 years ago

We really don't need the partial function at all. The same seg fault can be found by having Payload directly reference the _ServerConnection object.

actor _ServerConnection
  be dispatch(request: Payload) =>
    request.handler = recover this end
    this.answer(consume request)

  be answer(handler: Payload val) => None

class iso Payload
  var handler: (_ServerConnection | None) = None

SeanTAllen commented 8 years ago

So this last 8 line example is enough on its own to cause a segfault? Nicely narrowing of the issue down @Perelandric

Perelandric commented 8 years ago

@SeanTAllen Thanks, and yes, with the Main and Test actors from the example above it, it should reproduce the seg fault. I'll put a full version of the smallest example in this comment.

It'll be interesting to see if this actually fixes the httpserver example. There was some other odd behavior that I noted in issue #937 which is where my investigation began, but this is certainly a piece of the puzzle if not the entire thing.

Here's the most simplified yet complete code that I could find to reproduce the issue:

actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 10_000) do
      t.do_it()
    end

actor Test
  be do_it() => _ServerConnection.dispatch(Payload)

class iso Payload
  var handler: (_ServerConnection | None) = None

actor _ServerConnection
  be dispatch(p: Payload) =>
    p.handler = recover this end
    this.answer(consume p) // <-- Pass the Payload that references this _ServerConnection

  be answer(p: Payload val) => None

Notes:

In the more complete examples, the _ServerConnection is assigned to the Payload via a partial function binding instead of being assigned directly.
If the Payload object is created directly in dispatch() instead of at the call location in do_it(), there's no seg fault.
If you assign None to the .handler field instead of the _ServerConnection object, there's no seg fault.
The seg fault is intermittent. It's more likely to happen using a Range with 10_000, than with 100.

SeanTAllen commented 8 years ago

Have you tried running a debug version in LLDB to see what is causing the segfault?

Perelandric commented 8 years ago

@SeanTAllen Never used LLDB but this'll be as good a reason as any to tinker. I'll see if I can give it a shot tomorrow.

EDIT: Output from my first, mostly blind use of LLDB

(lldb) target create "./test"
Current executable set to './test' (x86_64).
(lldb) run
Process 18472 launched: './test' (x86_64)
Process 18472 exited with status = 0 (0x00000000) 
(lldb) run
Process 18485 launched: './test' (x86_64)
Process 18485 stopped
* thread #2: tid = 18490, 0x0000000000408063 test`pony_traceunknown + 3, name = 'test', stop reason = signal SIGSEGV: invalid address (fault address: 0x40)
    frame #0: 0x0000000000408063 test`pony_traceunknown + 3
test`pony_traceunknown:
->  0x408063 <+3>:  cmpq   $0x0, 0x40(%rax)
    0x408068 <+8>:  je     0x408070                  ; <+16>
    0x40806a <+10>: movq   0x18(%rdi), %rax
    0x40806e <+14>: jmpq   *%rax
(lldb) bt
* thread #2: tid = 18490, 0x0000000000408063 test`pony_traceunknown + 3, name = 'test', stop reason = signal SIGSEGV: invalid address (fault address: 0x40)
  * frame #0: 0x0000000000408063 test`pony_traceunknown + 3
    frame #1: 0x00000000004020ef test`Payload_Trace + 15
    frame #2: 0x000000000040e360 test`ponyint_gc_handlestack + 64
    frame #3: 0x0000000000407fa0 test`ponyint_mark_done + 32
    frame #4: 0x0000000000409304 test`ponyint_actor_run + 772
    frame #5: 0x0000000000405b85 test`run + 69
    frame #6: 0x0000000000405edd test`run_thread + 29
    frame #7: 0x00007ffff7bc4184 libpthread.so.0`start_thread + 196
    frame #8: 0x00007ffff74d737d libc.so.6`clone + 109
(lldb)

Had trouble setting breakpoints.

Perelandric commented 8 years ago

@SeanTAllen One more observation: It seems if the parameter of the answer() method is tag or iso, there's no segfault. Only a problem with val.

// -------------------v--- iso and tag are OK, but not val
be answer(p: Payload val) => None

peterzandbergen commented 8 years ago

Just out of curiosity, is there any news on this one? After watching the VUG about garbage collection I wondered if the error should be looked for in that area. Just a wild guess.

SeanTAllen commented 8 years ago

Nothing yet.

peterzandbergen commented 8 years ago

Difficult one I guess

SeanTAllen commented 8 years ago

And we could use more folks contributing. Its hard to keep up and do the day job and all that. Such is things... ¯(ツ)/¯

aturley commented 8 years ago

Hrm ... I just tried @Perelandric's minimal example from August 8 and I didn't get a seg fault. I tried both a debug and release build on OS X (10.11.4) using compiler version 0.3.0-14-gd8c6c55.

SeanTAllen commented 8 years ago

@perelandric can you see if this still happens for you?

Perelandric commented 8 years ago

@SeanTAllen Yep, took a few extra runs before it would segfault, but it got there.

@aturley Try upping the range from 10_000 to 100_000. That gives me the segfault more reliably.

aturley commented 8 years ago

I got a little more information from lldb:

(lldb) run
Process 12481 launched: './build/issue-1118' (x86_64)
Process 12481 stopped
* thread #2: tid = 0x3422ef, 0x000000010000b862 issue-1118`pony_traceunknown(ctx=0x00000001097ffc48, p=0x00000001217dd000, m=2) + 34 at trace.c:114, stop reason = EXC_BAD_ACCESS (code=1, address=0x40)
    frame #0: 0x000000010000b862 issue-1118`pony_traceunknown(ctx=0x00000001097ffc48, p=0x00000001217dd000, m=2) + 34 at trace.c:114
   111  {
   112    pony_type_t* t = *(pony_type_t**)p;
   113 
-> 114    if(t->dispatch != NULL)
   115    {
   116      ctx->trace_actor(ctx, (pony_actor_t*)p);
   117    } else {

Something is going a little sideways here in GC land.

SeanTAllen commented 8 years ago

Yeah. At the moment, this is waiting on @sylvanc or I having time to address. At the moment, I'm working on new website before returning back to the land of pony code. I'm not going to get to this til sometime in mid October. If anyone wants to take a shot at it before then, have at it.

dimiii commented 8 years ago

Hi, seems like by a happy coincidence it's just a simple typo.

-  pony_type_t* t = *(pony_type_t**)p;
+  pony_type_t* t = (pony_type_t*)p;

After a quick fix all is fine. Here is the cure https://github.com/ponylang/ponyc/pull/1249

SeanTAllen commented 8 years ago

@dimiii all CI jobs failed with a segfault with that change applied. I left some questions for you on #1249.

Praetonus commented 8 years ago

It looks like this is an uninitialised type descriptor pointer problem. I'll try digging into it.

Praetonus commented 8 years ago

I think I got it. This is a premature free issue.

Full tracing of val objects sent in messages is deferred until a GC cycle on the owner. This means the _ServerConnection assigned to the Payload and then sent in a message (via the Payload) has a reference count greater than 0 but isn't aware of it until the Payload is traced. By then, the _ServerConnection might have been GC'd, which of course results in bad stuff happening.

dimiii commented 8 years ago

@Praetonus @SeanTAllen

BTW, Do you have plans to bootstrap the compiler? I mean the compiler and runtime.

jemc commented 8 years ago

@dimiii - Building a Pony compiler in Pony is a long-term goal, though it represents a major amount of work. There are no plans to write the runtime in Pony, as it is built on lower-level constructs that would be impractical to provide in Pony.

SeanTAllen commented 8 years ago

@dimiii re: the runtime point @jemc from a safety standpoint, a Pony runtime written in something like Rust would be awesome. However, C is far more accessible to the community and as such, the plans are to stay with C for the forseeable future for the runtime.

SeanTAllen commented 8 years ago

@Praetonus should this be relabeled from "ready for work"?

Praetonus commented 8 years ago

I'm not sure. I can see a fix for this but I think it would have some performance penalty. Maybe removing it until we've discussed it on the sync call would make sense.

peterzandbergen commented 8 years ago

Stability should be more important, imho. I would not advocate using a tool for server applications if it could occasionally crash.

peterzandbergen commented 8 years ago

Especially when the design of the language is to eliminate crashes caused by the programmer.

SeanTAllen commented 8 years ago

no one is suggesting not fixing. the concern is, can we find a way to fix without having a performance impact.

jemc commented 8 years ago

@peterzandbergen - it definitely still needs to be fixed, and it's still marked as high priority. Runtime segmentation faults are not acceptable .

SeanTAllen commented 8 years ago

Ok, issue located. @sylvanc will be working on this. Thank you everyone who contributed to finding this. Extra props to @Perelandric for lots of narrowing down a minimal case and @Praetonus for digging into the code.

sylvanc commented 8 years ago

Yes, thank you VERY much! This is an important and subtle and tricky one.

peterzandbergen commented 8 years ago

Well done, never doubted that this would not be picked up in a serious way. Kudos

SeanTAllen commented 7 years ago

I'm trying to verify that #1507 fixes this issue but I'm unable to reproduce the reported problem.

OSX El Capitan. LLVM 3.9.1

@Praetonus @Perelandric @aturley can any of you reproduce?

jkleiser commented 7 years ago

Both the original code and the most simplified code here give me "can't find declaration of 'Range'". Why?

Praetonus commented 7 years ago

@jkleiser You have to add use "collections" at the beginning of the code.

jkleiser commented 7 years ago

Thanks @Praetonus . The simplified code now got compiled and ran without problems on my Mac with ponyc 0.10.0-74555f7.

moesol commented 7 years ago

The simplified code works on my Ubuntu machine. Both optimized and with ponyc --debug.

0.10.0-1c33065 [release] compiled with: llvm 3.9.0 -- gcc-5 (Ubuntu 5.4.1-2ubuntu1~14.04) 5.4.1 20160904

However, when I run the original code I get Segmentation fault (core dumped)

peterzandbergen commented 7 years ago

I have noticed that the issue did not occur in my environment anymore, being Ubuntu Mate in virtual box. Will test again this week to seeing if the problem has been resolved.

Op ma 23 jan. 2017 07:54 schreef Robert Hastings notifications@github.com:

The simplified code works on my Ubuntu machine. Both optimized and with ponyc --debug.

0.10.0-1c33065 [release] compiled with: llvm 3.9.0 -- gcc-5 (Ubuntu 5.4.1-2ubuntu1~14.04) 5.4.1 20160904

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ponylang/ponyc/issues/1118#issuecomment-274413255, or mute the thread https://github.com/notifications/unsubscribe-auth/AI2XrpZfLdrKh8_toJs6Mh3PLyCqZ6Vhks5rVE6_gaJpZM4JeyOG .

--

Peter Zandbergen Tel: +31 6 460 55 872

linkedIn: peterzandbergen https://www.linkedin.com/in/peterzandbergen

Perelandric commented 7 years ago

@peterzandbergen See the discussion at: https://github.com/ponylang/ponyc/pull/1507

SeanTAllen commented 7 years ago

Just an update. @sylvanc is still working on a fix for this.

SeanTAllen commented 6 years ago

Original version still segfaults as of 0.21.0

SeanTAllen commented 4 years ago

The original version of this segfaults if I add some @printf output that slows everything down some. But, only when doing an optimized build.

0.32.0-a7a80758 [release] compiled with: llvm 7.0.1 -- cc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 Defaults: pic=true

SeanTAllen commented 2 years ago

The original version of this segfaults for me but none of the others. I looked at it for a while and I'm somewhat confused. I hope to be able to spend more time on it and maybe get some more eyes on it.

SeanTAllen commented 2 years ago

The original version that segfaults doesn't if the usage of Payload val is changed to Payload so that an iso never becomes a val. There's still a bug there.

works:

use "collections"

interface val ResponseHandler
  fun val apply(request: Payload, response: Payload): Any

interface val RequestHandler
  fun val apply(request: Payload): Any

primitive Handle
  fun val apply(request: Payload) =>
    (consume request).respond(Payload)

actor _ServerConnection
  let _handler: RequestHandler

  new create(h: RequestHandler) => _handler = h

 be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    _handler(consume request)

  be answer(request: Payload, response: Payload) =>
    None

class iso Payload
  var handler: (ResponseHandler | None) = None

  fun iso respond(response': Payload) =>
    try
      let h = (handler) as ResponseHandler
      h(consume this, consume response')
    end

actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 10_000) do
      t.do_it()
    end

actor Test
  be do_it() =>
    _ServerConnection(Handle).dispatch(Payload)

This probably means that Benoit's statement here:

I think I got it. This is a premature free issue.

Full tracing of val objects sent in messages is deferred until a GC cycle on the owner. This means the >_ServerConnection assigned to the Payload and then sent in a message (via the Payload) has a reference count >greater than 0 but isn't aware of it until the Payload is traced. By then, the _ServerConnection might have been >GC'd, which of course results in bad stuff happening.

might still be true.

When I look at it in the debugger, it sure seems like what we are pointing at in trace_unknown is now junk. That said, I'm not positive yet.

SeanTAllen commented 2 years ago

Here's an updated example for what still fails:

use "collections"

interface val ResponseHandler
  fun val apply(request: Payload val, response: Payload): Any

interface val RequestHandler
  fun val apply(request: Payload): Any

primitive Handle
  fun val apply(request: Payload) =>
    (consume request).respond(Payload)

actor _ServerConnection
  let _handler: RequestHandler

  new create(h: RequestHandler) => _handler = h

 be dispatch(request: Payload) =>
    request.handler = recover this~answer() end
    _handler(consume request)

  be answer(request: Payload val, response: Payload) =>
    None

class iso Payload
  var handler: (ResponseHandler | None) = None

  fun iso respond(response': Payload) =>
    try
      let h = (handler) as ResponseHandler
      h(consume this, consume response')
    end

actor Main
  new create(env: Env) =>
    let t = Test

    for i in Range(0, 1_000_000) do
      t.do_it()
    end

actor Test
  be do_it() =>
    _ServerConnection(Handle).dispatch(Payload)

The important part of this is that the request object for ResponseHandler is going from an iso when sent to a val on receive. If it isn't a val on receive then all is good.

I'm not sure exactly where the bug is however, an important difference introduced here is that in recv_remote_object we do not recurse the object if it is immutable. IE this code:

  if(!obj->immutable)
    recurse(ctx, p, t->trace);

is executed when Payload is also an iso and then we have no crash.

SeanTAllen commented 2 years ago

Here's the backtrace:

* thread #8, name = '1118-1', stop reason = signal SIGSEGV: invalid address (fault address: 0x58)
  * frame #0: 0x00000000004158ab 1118-1`pony_traceunknown(ctx=0x00007ffff7c22108, p=0x00007fffde3f1400, m=1) at trace.c:127:18
    frame #1: 0x000000000041360e 1118-1`ponyint_gc_handlestack(ctx=0x00007ffff7c22108) at gc.c:677:5
    frame #2: 0x0000000000415629 1118-1`ponyint_mark_done(ctx=0x00007ffff7c22108) at trace.c:75:3
    frame #3: 0x000000000040c415 1118-1`try_gc(ctx=0x00007ffff7c22108, actor=0x00007fffe6c20800) at actor.c:301:3
    frame #4: 0x000000000040ba04 1118-1`ponyint_actor_run(ctx=0x00007ffff7c22108, actor=0x00007fffe6c20800, polling=false) at actor.c:373:7
    frame #5: 0x000000000041a9ba 1118-1`run(sched=0x00007ffff7c220c0) at scheduler.c:960:23
    frame #6: 0x0000000000419ed3 1118-1`run_thread(arg=0x00007ffff7c220c0) at scheduler.c:1010:3
    frame #7: 0x00007ffff7fa8609 libpthread.so.0`start_thread + 217
    frame #8: 0x00007ffff7d48293 libc.so.6`__clone + 67

In pony_traceunkown the first thing we do is:

pony_type_t* t = *(pony_type_t**)p;

And in the debugger when trying to see what t is pointing at we get:

(lldb) p t
(pony_type_t *) $3 = 0x0000000000000010
(lldb) p *t
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory

i.e. that sure looks like junk by that point.

SeanTAllen commented 2 years ago

I've been playing around with this and sometimes, I get this assert on line 755 of actor.c to trigger:

  // Make sure we're not trying to send a message to an actor that is about
  // to be destroyed.
  pony_assert(!ponyint_actor_pendingdestroy(to));

I also sometimes get a crash in the atomics for sending a message itself.

I've been using a debug build of runtime and program with --ponygcinitial=1 and --ponygcfactor=1

SeanTAllen commented 2 years ago

At least once, the kaboom message was a GC - acquire message

* thread #4, name = '1118', stop reason = signal SIGABRT
  * frame #0: 0x00007ffff7c6c18b libc.so.6`raise + 203
    frame #1: 0x00007ffff7c4b859 libc.so.6`abort + 299
    frame #2: 0x000000000041d47a 1118`ponyint_assert_fail(expr="!ponyint_actor_pendingdestroy(to)", file="/home/sean/code/ponylang/ponyc/src/libponyrt/actor/actor.c", line=762, func="pony_sendv") at ponyassert.c:65:3
    frame #3: 0x0000000000410d2b 1118`pony_sendv(ctx=0x00007ffff7c21a08, to=0x00007ffff7c20a00, first=0x00007fffde41f0a0, last=0x00007fffde41f0a0, has_app_msg=false) at actor.c:762:3
    frame #4: 0x000000000041121b 1118`pony_sendp(ctx=0x00007ffff7c21a08, to=0x00007ffff7c20a00, id=4294967291, p=0x00007fffde41ec40) at actor.c:889:3
    frame #5: 0x0000000000417c9a 1118`ponyint_gc_sendacquire(ctx=0x00007ffff7c21a08) at gc.c:797:5

SeanTAllen commented 2 years ago

Ok so the "source" of our problem is the large block of code here:

https://github.com/ponylang/ponyc/blob/main/src/libponyrt/actor/actor.c#L444

Where we see the following:

an actor has completed its turn on the scheduler and it has an empty message queue
the actor has reference count of 0 so it is ok to delete

except, there's still a reference to it in a message and when the receiver sends an acquire message back, we go kaboom if we have already deleted ourselves or otherwise marked ourselves for deletion (which can happen in the debug case).

We apparently do not see the reference to ourself when we send (we probably aren't tracing?) and then bad things. Given that changing the receiver to an iso from a val suggests the general path that we are missing noticing that it is us.

More to look into but a start.

Commenting out the entire block that starts at line 444 of actor.c "fixes" this crash, but it does create other issues with memory growth etc that the local delete is meant to fix.

ponylang / ponyc

Segmentation fault when actor receives a reference to itself via a class created in a different actor. #1118