Closed Perelandric closed 1 year ago
Here's a further reduced example:
interface val ResponseHandler
fun val apply(request: Payload val): Any
actor _ServerConnection
be dispatch(request: Payload) =>
request.handler = recover this~answer() end
try
let h = request.handler as ResponseHandler
h(consume request)
end
be answer(request: Payload val) => None
class iso Payload
var handler: (ResponseHandler | None) = None
actor Main
new create(env: Env) =>
let t = Test
for i in Range(0, 10_000) do
t.do_it()
end
actor Test
be do_it() =>
_ServerConnection.dispatch(Payload)
If I change the dispatch
method to this, the seg fault is gone:
be dispatch(request: Payload) =>
let x: (ResponseHandler | None) = recover this~answer() end
try
let h = x as ResponseHandler
h(consume request)
end
but if I leave off the : (ResponseHandler | None)
on the x
declaration, it complains of capture violations. If I then make it recover to val
, it compiles and the seg fault is again gone.
let x = recover val this~answer() end
Or if I simply set request.handler = None
after the let h
assignment, the seg fault is gone.
actor _ServerConnection
be dispatch(request: Payload) =>
request.handler = recover this~answer() end
try
let h = request.handler as ResponseHandler
request.handler = None
h(consume request)
end
So it seems like it has something to do with sending the _ServerConnection
into its own .answer()
method via the partial function that captures this
and gets assigned to the Payload
.
Unfortunately, adding handler = None
in the respond
method in net/http/payload.pony
does not fix it. :confused:
Given that last bit of info, I can reduce it just a bit more:
use "collections"
interface val ResponseHandler
fun val apply(request: Payload val): Any
actor _ServerConnection
be dispatch(request: Payload) =>
request.handler = recover this~answer() end
this.answer(consume request)
be answer(request: Payload val) => None
class iso Payload
var handler: (ResponseHandler | None) = None
actor Main
new create(env: Env) =>
let t = Test
for i in Range(0, 10_000) do
t.do_it()
end
actor Test
be do_it() =>
_ServerConnection.dispatch(Payload)
Note that if I create the Payload
directly in dispatch
instead of passing it from do_it()
, we don't get the seg fault. Same if I eliminate _ServerConnection
and put all its methods right in Test
.
I tried to see if I could create the seg fault by passing the partial function directly to answer
instead of via Payload
, but wasn't able to get the type quite right for the parameter.
We really don't need the partial function at all. The same seg fault can be found by having Payload
directly reference the _ServerConnection
object.
actor _ServerConnection
be dispatch(request: Payload) =>
request.handler = recover this end
this.answer(consume request)
be answer(handler: Payload val) => None
class iso Payload
var handler: (_ServerConnection | None) = None
So this last 8 line example is enough on its own to cause a segfault? Nicely narrowing of the issue down @Perelandric
@SeanTAllen Thanks, and yes, with the Main
and Test
actors from the example above it, it should reproduce the seg fault. I'll put a full version of the smallest example in this comment.
It'll be interesting to see if this actually fixes the httpserver example. There was some other odd behavior that I noted in issue #937 which is where my investigation began, but this is certainly a piece of the puzzle if not the entire thing.
Here's the most simplified yet complete code that I could find to reproduce the issue:
actor Main
new create(env: Env) =>
let t = Test
for i in Range(0, 10_000) do
t.do_it()
end
actor Test
be do_it() => _ServerConnection.dispatch(Payload)
class iso Payload
var handler: (_ServerConnection | None) = None
actor _ServerConnection
be dispatch(p: Payload) =>
p.handler = recover this end
this.answer(consume p) // <-- Pass the Payload that references this _ServerConnection
be answer(p: Payload val) => None
Notes:
_ServerConnection
is assigned to the Payload
via a partial function binding instead of being assigned directly. Payload
object is created directly in dispatch()
instead of at the call location in do_it()
, there's no seg fault.None
to the .handler
field instead of the _ServerConnection
object, there's no seg fault.Range
with 10_000
, than with 100
.Have you tried running a debug version in LLDB to see what is causing the segfault?
@SeanTAllen Never used LLDB but this'll be as good a reason as any to tinker. I'll see if I can give it a shot tomorrow.
EDIT: Output from my first, mostly blind use of LLDB
(lldb) target create "./test"
Current executable set to './test' (x86_64).
(lldb) run
Process 18472 launched: './test' (x86_64)
Process 18472 exited with status = 0 (0x00000000)
(lldb) run
Process 18485 launched: './test' (x86_64)
Process 18485 stopped
* thread #2: tid = 18490, 0x0000000000408063 test`pony_traceunknown + 3, name = 'test', stop reason = signal SIGSEGV: invalid address (fault address: 0x40)
frame #0: 0x0000000000408063 test`pony_traceunknown + 3
test`pony_traceunknown:
-> 0x408063 <+3>: cmpq $0x0, 0x40(%rax)
0x408068 <+8>: je 0x408070 ; <+16>
0x40806a <+10>: movq 0x18(%rdi), %rax
0x40806e <+14>: jmpq *%rax
(lldb) bt
* thread #2: tid = 18490, 0x0000000000408063 test`pony_traceunknown + 3, name = 'test', stop reason = signal SIGSEGV: invalid address (fault address: 0x40)
* frame #0: 0x0000000000408063 test`pony_traceunknown + 3
frame #1: 0x00000000004020ef test`Payload_Trace + 15
frame #2: 0x000000000040e360 test`ponyint_gc_handlestack + 64
frame #3: 0x0000000000407fa0 test`ponyint_mark_done + 32
frame #4: 0x0000000000409304 test`ponyint_actor_run + 772
frame #5: 0x0000000000405b85 test`run + 69
frame #6: 0x0000000000405edd test`run_thread + 29
frame #7: 0x00007ffff7bc4184 libpthread.so.0`start_thread + 196
frame #8: 0x00007ffff74d737d libc.so.6`clone + 109
(lldb)
Had trouble setting breakpoints.
@SeanTAllen One more observation: It seems if the parameter of the answer()
method is tag
or iso
, there's no segfault. Only a problem with val
.
// -------------------v--- iso and tag are OK, but not val
be answer(p: Payload val) => None
Just out of curiosity, is there any news on this one? After watching the VUG about garbage collection I wondered if the error should be looked for in that area. Just a wild guess.
Nothing yet.
Difficult one I guess
And we could use more folks contributing. Its hard to keep up and do the day job and all that. Such is things... ¯(ツ)/¯
Hrm ... I just tried @Perelandric's minimal example from August 8 and I didn't get a seg fault. I tried both a debug and release build on OS X (10.11.4) using compiler version 0.3.0-14-gd8c6c55.
@perelandric can you see if this still happens for you?
@SeanTAllen Yep, took a few extra runs before it would segfault, but it got there.
@aturley Try upping the range from 10_000
to 100_000
. That gives me the segfault more reliably.
I got a little more information from lldb:
(lldb) run
Process 12481 launched: './build/issue-1118' (x86_64)
Process 12481 stopped
* thread #2: tid = 0x3422ef, 0x000000010000b862 issue-1118`pony_traceunknown(ctx=0x00000001097ffc48, p=0x00000001217dd000, m=2) + 34 at trace.c:114, stop reason = EXC_BAD_ACCESS (code=1, address=0x40)
frame #0: 0x000000010000b862 issue-1118`pony_traceunknown(ctx=0x00000001097ffc48, p=0x00000001217dd000, m=2) + 34 at trace.c:114
111 {
112 pony_type_t* t = *(pony_type_t**)p;
113
-> 114 if(t->dispatch != NULL)
115 {
116 ctx->trace_actor(ctx, (pony_actor_t*)p);
117 } else {
Something is going a little sideways here in GC land.
Yeah. At the moment, this is waiting on @sylvanc or I having time to address. At the moment, I'm working on new website before returning back to the land of pony code. I'm not going to get to this til sometime in mid October. If anyone wants to take a shot at it before then, have at it.
Hi, seems like by a happy coincidence it's just a simple typo.
- pony_type_t* t = *(pony_type_t**)p;
+ pony_type_t* t = (pony_type_t*)p;
After a quick fix all is fine. Here is the cure https://github.com/ponylang/ponyc/pull/1249
@dimiii all CI jobs failed with a segfault with that change applied. I left some questions for you on #1249.
It looks like this is an uninitialised type descriptor pointer problem. I'll try digging into it.
I think I got it. This is a premature free issue.
Full tracing of val
objects sent in messages is deferred until a GC cycle on the owner. This means the _ServerConnection
assigned to the Payload
and then sent in a message (via the Payload
) has a reference count greater than 0 but isn't aware of it until the Payload
is traced. By then, the _ServerConnection
might have been GC'd, which of course results in bad stuff happening.
@Praetonus @SeanTAllen
BTW, Do you have plans to bootstrap the compiler? I mean the compiler and runtime.
@dimiii - Building a Pony compiler in Pony is a long-term goal, though it represents a major amount of work. There are no plans to write the runtime in Pony, as it is built on lower-level constructs that would be impractical to provide in Pony.
@dimiii re: the runtime point @jemc from a safety standpoint, a Pony runtime written in something like Rust would be awesome. However, C is far more accessible to the community and as such, the plans are to stay with C for the forseeable future for the runtime.
@Praetonus should this be relabeled from "ready for work"?
I'm not sure. I can see a fix for this but I think it would have some performance penalty. Maybe removing it until we've discussed it on the sync call would make sense.
Stability should be more important, imho. I would not advocate using a tool for server applications if it could occasionally crash.
Especially when the design of the language is to eliminate crashes caused by the programmer.
no one is suggesting not fixing. the concern is, can we find a way to fix without having a performance impact.
@peterzandbergen - it definitely still needs to be fixed, and it's still marked as high priority. Runtime segmentation faults are not acceptable .
Ok, issue located. @sylvanc will be working on this. Thank you everyone who contributed to finding this. Extra props to @Perelandric for lots of narrowing down a minimal case and @Praetonus for digging into the code.
Yes, thank you VERY much! This is an important and subtle and tricky one.
Well done, never doubted that this would not be picked up in a serious way. Kudos
I'm trying to verify that #1507 fixes this issue but I'm unable to reproduce the reported problem.
OSX El Capitan. LLVM 3.9.1
@Praetonus @Perelandric @aturley can any of you reproduce?
Both the original code and the most simplified code here give me "can't find declaration of 'Range'". Why?
@jkleiser You have to add use "collections"
at the beginning of the code.
Thanks @Praetonus . The simplified code now got compiled and ran without problems on my Mac with ponyc 0.10.0-74555f7.
The simplified code works on my Ubuntu machine. Both optimized and with ponyc --debug.
0.10.0-1c33065 [release] compiled with: llvm 3.9.0 -- gcc-5 (Ubuntu 5.4.1-2ubuntu1~14.04) 5.4.1 20160904
However, when I run the original code I get Segmentation fault (core dumped)
I have noticed that the issue did not occur in my environment anymore, being Ubuntu Mate in virtual box. Will test again this week to seeing if the problem has been resolved.
Op ma 23 jan. 2017 07:54 schreef Robert Hastings notifications@github.com:
The simplified code works on my Ubuntu machine. Both optimized and with ponyc --debug.
0.10.0-1c33065 [release] compiled with: llvm 3.9.0 -- gcc-5 (Ubuntu 5.4.1-2ubuntu1~14.04) 5.4.1 20160904
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ponylang/ponyc/issues/1118#issuecomment-274413255, or mute the thread https://github.com/notifications/unsubscribe-auth/AI2XrpZfLdrKh8_toJs6Mh3PLyCqZ6Vhks5rVE6_gaJpZM4JeyOG .
--
Peter Zandbergen Tel: +31 6 460 55 872
linkedIn: peterzandbergen https://www.linkedin.com/in/peterzandbergen
@peterzandbergen See the discussion at: https://github.com/ponylang/ponyc/pull/1507
Just an update. @sylvanc is still working on a fix for this.
Original version still segfaults as of 0.21.0
The original version of this segfaults if I add some @printf output that slows everything down some. But, only when doing an optimized build.
0.32.0-a7a80758 [release] compiled with: llvm 7.0.1 -- cc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 Defaults: pic=true
The original version of this segfaults for me but none of the others. I looked at it for a while and I'm somewhat confused. I hope to be able to spend more time on it and maybe get some more eyes on it.
The original version that segfaults doesn't if the usage of Payload val
is changed to Payload
so that an iso
never becomes a val
. There's still a bug there.
works:
use "collections"
interface val ResponseHandler
fun val apply(request: Payload, response: Payload): Any
interface val RequestHandler
fun val apply(request: Payload): Any
primitive Handle
fun val apply(request: Payload) =>
(consume request).respond(Payload)
actor _ServerConnection
let _handler: RequestHandler
new create(h: RequestHandler) => _handler = h
be dispatch(request: Payload) =>
request.handler = recover this~answer() end
_handler(consume request)
be answer(request: Payload, response: Payload) =>
None
class iso Payload
var handler: (ResponseHandler | None) = None
fun iso respond(response': Payload) =>
try
let h = (handler) as ResponseHandler
h(consume this, consume response')
end
actor Main
new create(env: Env) =>
let t = Test
for i in Range(0, 10_000) do
t.do_it()
end
actor Test
be do_it() =>
_ServerConnection(Handle).dispatch(Payload)
This probably means that Benoit's statement here:
I think I got it. This is a premature free issue.
Full tracing of val objects sent in messages is deferred until a GC cycle on the owner. This means the >_ServerConnection assigned to the Payload and then sent in a message (via the Payload) has a reference count >greater than 0 but isn't aware of it until the Payload is traced. By then, the _ServerConnection might have been >GC'd, which of course results in bad stuff happening.
might still be true.
When I look at it in the debugger, it sure seems like what we are pointing at in trace_unknown is now junk. That said, I'm not positive yet.
Here's an updated example for what still fails:
use "collections"
interface val ResponseHandler
fun val apply(request: Payload val, response: Payload): Any
interface val RequestHandler
fun val apply(request: Payload): Any
primitive Handle
fun val apply(request: Payload) =>
(consume request).respond(Payload)
actor _ServerConnection
let _handler: RequestHandler
new create(h: RequestHandler) => _handler = h
be dispatch(request: Payload) =>
request.handler = recover this~answer() end
_handler(consume request)
be answer(request: Payload val, response: Payload) =>
None
class iso Payload
var handler: (ResponseHandler | None) = None
fun iso respond(response': Payload) =>
try
let h = (handler) as ResponseHandler
h(consume this, consume response')
end
actor Main
new create(env: Env) =>
let t = Test
for i in Range(0, 1_000_000) do
t.do_it()
end
actor Test
be do_it() =>
_ServerConnection(Handle).dispatch(Payload)
The important part of this is that the request
object for ResponseHandler
is going from an iso when sent to a val on receive. If it isn't a val
on receive then all is good.
I'm not sure exactly where the bug is however, an important difference introduced here is that in recv_remote_object
we do not recurse the object if it is immutable. IE this code:
if(!obj->immutable)
recurse(ctx, p, t->trace);
is executed when Payload is also an iso
and then we have no crash.
Here's the backtrace:
* thread #8, name = '1118-1', stop reason = signal SIGSEGV: invalid address (fault address: 0x58)
* frame #0: 0x00000000004158ab 1118-1`pony_traceunknown(ctx=0x00007ffff7c22108, p=0x00007fffde3f1400, m=1) at trace.c:127:18
frame #1: 0x000000000041360e 1118-1`ponyint_gc_handlestack(ctx=0x00007ffff7c22108) at gc.c:677:5
frame #2: 0x0000000000415629 1118-1`ponyint_mark_done(ctx=0x00007ffff7c22108) at trace.c:75:3
frame #3: 0x000000000040c415 1118-1`try_gc(ctx=0x00007ffff7c22108, actor=0x00007fffe6c20800) at actor.c:301:3
frame #4: 0x000000000040ba04 1118-1`ponyint_actor_run(ctx=0x00007ffff7c22108, actor=0x00007fffe6c20800, polling=false) at actor.c:373:7
frame #5: 0x000000000041a9ba 1118-1`run(sched=0x00007ffff7c220c0) at scheduler.c:960:23
frame #6: 0x0000000000419ed3 1118-1`run_thread(arg=0x00007ffff7c220c0) at scheduler.c:1010:3
frame #7: 0x00007ffff7fa8609 libpthread.so.0`start_thread + 217
frame #8: 0x00007ffff7d48293 libc.so.6`__clone + 67
In pony_traceunkown
the first thing we do is:
pony_type_t* t = *(pony_type_t**)p;
And in the debugger when trying to see what t
is pointing at we get:
(lldb) p t
(pony_type_t *) $3 = 0x0000000000000010
(lldb) p *t
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory
i.e. that sure looks like junk by that point.
I've been playing around with this and sometimes, I get this assert on line 755 of actor.c to trigger:
// Make sure we're not trying to send a message to an actor that is about
// to be destroyed.
pony_assert(!ponyint_actor_pendingdestroy(to));
I also sometimes get a crash in the atomics for sending a message itself.
I've been using a debug build of runtime and program with --ponygcinitial=1 and --ponygcfactor=1
At least once, the kaboom message was a GC - acquire message
* thread #4, name = '1118', stop reason = signal SIGABRT
* frame #0: 0x00007ffff7c6c18b libc.so.6`raise + 203
frame #1: 0x00007ffff7c4b859 libc.so.6`abort + 299
frame #2: 0x000000000041d47a 1118`ponyint_assert_fail(expr="!ponyint_actor_pendingdestroy(to)", file="/home/sean/code/ponylang/ponyc/src/libponyrt/actor/actor.c", line=762, func="pony_sendv") at ponyassert.c:65:3
frame #3: 0x0000000000410d2b 1118`pony_sendv(ctx=0x00007ffff7c21a08, to=0x00007ffff7c20a00, first=0x00007fffde41f0a0, last=0x00007fffde41f0a0, has_app_msg=false) at actor.c:762:3
frame #4: 0x000000000041121b 1118`pony_sendp(ctx=0x00007ffff7c21a08, to=0x00007ffff7c20a00, id=4294967291, p=0x00007fffde41ec40) at actor.c:889:3
frame #5: 0x0000000000417c9a 1118`ponyint_gc_sendacquire(ctx=0x00007ffff7c21a08) at gc.c:797:5
Ok so the "source" of our problem is the large block of code here:
https://github.com/ponylang/ponyc/blob/main/src/libponyrt/actor/actor.c#L444
Where we see the following:
except, there's still a reference to it in a message and when the receiver sends an acquire message back, we go kaboom if we have already deleted ourselves or otherwise marked ourselves for deletion (which can happen in the debug case).
We apparently do not see the reference to ourself when we send (we probably aren't tracing?) and then bad things. Given that changing the receiver to an iso from a val suggests the general path that we are missing noticing that it is us.
More to look into but a start.
Commenting out the entire block that starts at line 444 of actor.c "fixes" this crash, but it does create other issues with memory growth etc that the local delete is meant to fix.
_EDIT:_ Skip down to https://github.com/ponylang/ponyc/issues/1118#issuecomment-238431412 to see the most reduced example of the issue.
I gutted the HTTP server down to this, which I think is a reproduction of the seg fault in #937. I kept the original type names so that they could be somewhat related back to that package if necessary. Seems to have something to do with the partial function
this~answer()
. At least if I interrupt anything after that assignment, it the seg fault disappeared.This could probably be further reduced, but I wanted to maintain at least a slight semblance to the original code... and it's the middle of the night so I'm going to :sleeping:.