derrickturk commented 2 years ago

Opening apology: this is a much-reduced version of a solution to Advent of Code's 2019 day 7 puzzle, which is hard to explain in a vacuum but comes down to implementing a tiny VM with async I/O and running multiple instances of it in communication with each other. After cutting out a lot of extraneous things, I'm left with the attached source files and input. I'm reluctant to reduce it further because I'm starting to think the crash has nothing to do with the VM logic itself, but it's challenging to produce relevant input for a cut-down VM.

My notes refer to testing on both Windows 10 and Arch Linux via WSL2. In both cases ponyc is the latest release 0.51.2. This is a 4-core machine.

Run with: ./intpony input.txt

The three observed behaviors are:

Almost instant exit with no output (~60-70% of the time)
Almost instant exit with correct output (~15% of the time)
Slow exit AFTER correct output (~15% of the time)

The only runtime option I've found to have any effect on this is --ponymaxthreads 1, which on Windows seemingly guarantees the intended output (with fast exit).

Compiled with -d, I get additional possible outcomes including mismatches between the count of RUN outputs and DONE outputs. (This should not happen given the input - each VM/Cpu should run to successful halt.)

The program ends in a segfault, usually, on either the release or debug binary. It's often reported (by gdb) in pony_os_peername on Windows, and in Array_I64_val_Trace on WSL2. Oddly, runs with no segfault also have no output, and successful runs produce output before segfaulting. With --ponymaxthreads 1, no segfault on Windows, but I still get segfaults on WSL2.

The test program creates 100 Cpu actors total, each with a 519-"word" memory (i.e. an Array[I64] with 519 entries). This can be adjusted; it seems that segfaults get more likely as the number goes up. I've never seen a segfault with only 1 or 2 actors, but I have with 4 or 5.

Full stack trace from a crash on WSL2:

#0  0x0000555555568160 in Array_I64_val_Trace ()
#1  0x000055555556895d in Array_u3_t2_$1$10_iso_$1$10_iso_collections__MapEmpty_val_collections__MapDeleted_val_Trace
    ()
#2  0x00007fffe6807600 in ?? ()
#3  0x00007fffe657d600 in ?? ()
#4  0x00007fffffffe750 in ?? ()
#5  0x00007ffff7c91c48 in ?? ()
#6  0x00007ffff7c91e00 in ?? ()
#7  0x0000000000000700 in ?? ()
#8  0x000055555557d534 in ponyint_actor_final ()
#9  0x000055555557f07b in ponyint_cycle_terminate ()
#10 0x000055555558726f in ponyint_sched_shutdown ()
#11 0x0000555555585acf in ponyint_sched_start ()
#12 0x00005555555877d9 in pony_start ()
#13 0x000055555557ce17 in main ()

I've also seen:

#0  0x00007fffe671ccc0 in ?? ()
#1  0x000055555557d534 in ponyint_actor_final ()
#2  0x000055555557f07b in ponyint_cycle_terminate ()
#3  0x000055555558726f in ponyint_sched_shutdown ()
#4  0x0000555555585acf in ponyint_sched_start ()
#5  0x00005555555877d9 in pony_start ()
#6  0x000055555557ce17 in main ()

The plot thickened with a suggestion on Zulip to run with --ponynoblock, disabling the cycle checker (IIUC). This resulted in "dropped output" about 50-75% of the time (rate maybe dependent on running under a debugger or not), but no segfaults.

intpony.zip

SeanTAllen commented 2 years ago

@derrickturk have you had a chance to investigate this at all? I know from zulip you were going to look into it some.

If yes, any findings? If no, do you need assistance?

SeanTAllen commented 2 years ago

@derrickturk i believe we discussed some about this in zulip, can you bring over the relevant discussion points to this issue so they can help others in the future if anyone else tries to tackle this?

derrickturk commented 1 year ago

Sharing my notes here from last month after a long delay. I spent some time back in October poking at this after discussing some ideas on Zulip.

I've been testing this all with a debug build of ponyc, in an Arch setup on WSL2:

~/.local/bin/bin/ponyc --version
0.51.3-2665ee7c [debug]
Compiled with: LLVM 14.0.3 -- Clang-14.0.6-x86_64
Defaults: pic=true

Executions of my test program (same as above) are ending in a segmentation fault most of the time, with the full expected output. Sometimes I get the full expected output with no segmentation fault; every once in a while I see no final output and no segmentation fault.

Segmentation faults appear in the tracing functions invoked via ponyint_actor_final in turn invoked via the cycle checker:

Process 8020 stopped
* thread #1, name = 'intpony', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
    frame #0: 0x000055555559f201 intpony`Array_I64_val_Trace + 1
intpony`Array_I64_val_Trace:
->  0x55555559f201 <+1>:  movq   0x18(%rsi), %rsi
    0x55555559f205 <+5>:  callq  0x5555555f85a0            ; pony_trace at trace.c:104
    0x55555559f20a <+10>: popq   %rax
    0x55555559f20b <+11>: retq
(lldb) bt
* thread #1, name = 'intpony', stop reason = signal SIGSEGV: invalid address (fault address: 0x0)
  * frame #0: 0x000055555559f201 intpony`Array_I64_val_Trace + 1
    frame #1: 0x00005555555f85cb intpony`pony_trace(ctx=0x00007fffe68ac600, p=0x4000007fffe66cb8) at trace.c:105:3
    frame #2: 0x000055555559f20a intpony`Array_I64_val_Trace + 10
    frame #3: 0x00005555555f85cb intpony`pony_trace(ctx=0x00007fffe68ac600, p=0x00007fffe66cc801) at trace.c:105:3
    frame #4: 0x000055555559f77c intpony`Array_u3_t2_$1$10_iso_$1$10_iso_collections__MapEmpty_val_collections__MapDeleted_val_Trace + 28
    frame #5: 0x00005555555ef11b intpony`ponyint_actor_final(ctx=0x00007ffff7cbbc48, actor=0x00007fffe68ac600) at actor.c:859:5
    frame #6: 0x00005555555f2b47 intpony`final(ctx=0x00007ffff7cbbc48, self=0x00007ffff7cbbe00) at cycle.c:893:7
    frame #7: 0x00005555555f29d3 intpony`ponyint_cycle_terminate at cycle.c:1183:3
    frame #8: 0x00005555555fee5f intpony`ponyint_sched_shutdown at scheduler.c:1143:3
    frame #9: 0x00005555555fd2ef intpony`ponyint_sched_start(library=false) at scheduler.c:1330:5
    frame #10: 0x00005555555ff6f0 intpony`pony_start(library=false, exit_code=0x0000000000000000, language_features=0x00007fffffffe8d0) at start.c:342:7
    frame #11: 0x00005555555ee3cb intpony`main + 267
    frame #12: 0x00007ffff7ce1290 libc.so.6`___lldb_unnamed_symbol3141 + 128
    frame #13: 0x00007ffff7ce134a libc.so.6`__libc_start_main + 138
    frame #14: 0x000055555559da85 intpony`_start at start.S:115

With --ponynoblock, I have not received any segmentation faults, but do sometimes see the final output fail to appear.

Following a suggestion on Zulip from @SeanTAllen, I looked into the possibility of an issue with "orphaned" actors. Setting a breakpoint/watch on pony_create does not show any actors created with orphaned = false. I was unable to find any other code pertinent to the orphaned-actor issue but I'm happy to poke in other places with a little direction.

I'm interested in helping find the underlying issue with the cycle-checker, but I'm also as interested or more in understanding why I see the early-termination behavior. It's possible I've misunderstood the quiescence rules. I am going to try to reduce my example program to something more minimal.

derrickturk commented 1 year ago

Well, I've reduced the program to reproduce this bug, or at least I think I have. The specific manifestation of the segmentation fault varied as I made changes. The current single-file test program is at the end of this post. The current behavior is:

Usually, no output (early termination, no segfault)
Infrequently, correct output ("Problem 1: 5")
Infrequently, with or without correct output, a segmentation fault

The last backtrace I captured is:

(lldb) run
    Process 15633 launched: '/home/derrick/intpony/intpony' (x86_64)
    Process 15633 stopped
    * thread #1, name = 'intpony', stop reason = signal SIGSEGV: invalid address (fault address: 0x7fff0000008e)
        frame #0: 0x0000555555568aaf intpony`ponyint_actor_final(ctx=0x00007ffff7cbbc48, actor=0x00007fffe6707200) at actor.c:858:25
       855    ctx->current = actor;
       856
       857    // Run the actor finaliser if it has one.
    -> 858    if(actor->type->final != NULL)
       859      actor->type->final(actor);
       860
       861    // Run all outstanding object finalisers.
    (lldb) bt
    * thread #1, name = 'intpony', stop reason = signal SIGSEGV: invalid address (fault address: 0x7fff0000008e)
      * frame #0: 0x0000555555568aaf intpony`ponyint_actor_final(ctx=0x00007ffff7cbbc48, actor=0x00007fffe6707200) at actor.c:858:25
        frame #1: 0x000055555556c4f7 intpony`final(ctx=0x00007ffff7cbbc48, self=0x00007ffff7cbbe00) at cycle.c:893:7
        frame #2: 0x000055555556c383 intpony`ponyint_cycle_terminate at cycle.c:1183:3
        frame #3: 0x0000555555577eff intpony`ponyint_sched_shutdown at scheduler.c:1143:3
        frame #4: 0x000055555557638f intpony`ponyint_sched_start(library=false) at scheduler.c:1330:5
        frame #5: 0x0000555555578790 intpony`pony_start(library=false, exit_code=0x0000000000000000, language_features=0x00007fffffffe8b0) at start.c:342:7
        frame #6: 0x0000555555567d67 intpony`main + 231
        frame #7: 0x00007ffff7ce1290 libc.so.6`___lldb_unnamed_symbol3141 + 128
        frame #8: 0x00007ffff7ce134a libc.so.6`__libc_start_main + 138
        frame #9: 0x0000555555562795 intpony`_start at start.S:115

At one point while reducing the code, when I still had an Array[I64] serving as an input queue for each Cpu, I saw a segmentation fault in a different place:

* thread #1, name = 'intpony', stop reason = signal SIGSEGV: invalid address (fault address: 0x50)
        frame #0: 0x00005555555773cf intpony`final_small(chunk=0x00007fffe6b851c0, force_finalisers_mask=4294967295) at heap.c:196:5
       193      p = chunk->m + (bit << HEAP_MINBITS);
       194
       195      // run finaliser
    -> 196      pony_assert((*(pony_type_t**)p)->final != NULL);
       197      (*(pony_type_t**)p)->final(p);
       198
       199      // clear bit just found in our local finaliser map
    (lldb) bt
    * thread #1, name = 'intpony', stop reason = signal SIGSEGV: invalid address (fault address: 0x50)
      * frame #0: 0x00005555555773cf intpony`final_small(chunk=0x00007fffe6b851c0, force_finalisers_mask=4294967295) at heap.c:196:5
        frame #1: 0x00005555555770f7 intpony`chunk_list(f=(intpony`final_small at heap.c:165), current=0x00007fffe6b851c0, mark=4294967295) at heap.c:377:5
        frame #2: 0x0000555555577230 intpony`ponyint_heap_final(heap=0x00007fffe6c16a40) at heap.c:426:5
        frame #3: 0x000055555556d248 intpony`ponyint_actor_final(ctx=0x00007ffff7cbbc48, actor=0x00007fffe6c16a00) at actor.c:862:3
        frame #4: 0x0000555555570c67 intpony`final(ctx=0x00007ffff7cbbc48, self=0x00007ffff7cbbe00) at cycle.c:893:7
        frame #5: 0x0000555555570af3 intpony`ponyint_cycle_terminate at cycle.c:1183:3
        frame #6: 0x000055555557c66f intpony`ponyint_sched_shutdown at scheduler.c:1143:3
        frame #7: 0x000055555557aaff intpony`ponyint_sched_start(library=false) at scheduler.c:1330:5
        frame #8: 0x000055555557cf00 intpony`pony_start(library=false, exit_code=0x0000000000000000, language_features=0x00007fffffffe8b0) at start.c:342:7
        frame #9: 0x000055555556c477 intpony`main + 231
        frame #10: 0x00007ffff7ce1290 libc.so.6`___lldb_unnamed_symbol3141 + 128
        frame #11: 0x00007ffff7ce134a libc.so.6`__libc_start_main + 138
        frame #12: 0x0000555555564ea5 intpony`_start at start.S:115

I have not seen the original segfault in the trace function for Array[I64] since removing the "memory" array from the Cpu actor.

At this point I am still wondering if this might be related to #1118, as well as how many distinct bugs I am seeing.

use "collections"

actor Main
  new create(env: Env) =>
    let mk = MaxKeeper

    let waiter = CpuWaiter({()(env) =>
      mk.query({(max: (I64 | None)) =>
        match max
        | let m: I64 =>
            env.out.print("Problem 1: " + max.string())
        else
          env.err.print("No max")
        end
      })
    })

    // the larger this number is, the more likely a segfault
    for iter in Range(0, 1000) do
      var cpus = Array[Cpu]
      for i in Range(0, 5) do
        let cpu = Cpu

        if i > 0 then
          try
            cpus(cpus.size() - 1)?.subscribe(cpu)
          end
        end

        cpus.push(cpu)
      end

      try
        cpus(cpus.size() - 1)?.subscribe(mk)
        waiter.wait(cpus(cpus.size() - 1)?)
      end

      try
        cpus(0)?.send(0)
      end

      for cpu in cpus.values() do
        cpu.run()
      end
    end
    waiter.start_waiting()

interface Sendable
  fun tag send(word: I64)

actor Cpu is Sendable
  var _sub: (Sendable tag | None) = None
  var _on_halt: ({ref ()} iso | None) = None
  var _blocked: Bool = false
  var _step: U8 = 1
  var _val: I64 = -1

  be subscribe(rcvr: Sendable tag) =>
    _sub = rcvr

  be subscribe_halt(notify: {ref ()} iso) =>
    _on_halt = consume notify

  be step() =>
    if _blocked then
      return
    end

    match _step
    | 1 =>
        if _val >= 0 then
          _step = 2
        else
          _blocked = true
        end
    | 2 =>
        match _sub
        | let rcvr: Sendable tag => rcvr.send(_val + 1)
        end
        _step = 3
    | 3 =>
        match _on_halt
        | let rcvr: {ref ()} iso => rcvr()
        end
        _step = 4
    end

    if (not _blocked) and (_step < 4) then
      step()
    end

  be run() =>
    if not _blocked then
      step()
    end

  be send(word: I64) =>
    _val = word
    if _blocked then
        _blocked = false
        step()
    end

actor MaxKeeper
  var _max: (I64 | None) = None

  be send(word: I64) =>
    _max = match _max
    | let m: I64 => m.max(word)
    else
      word
    end

  be query(fn: {((I64 | None))} val) =>
    fn(_max)

actor CpuWaiter
  var _waiting: SetIs[Cpu tag] = SetIs[Cpu tag]
  var _wait: Bool = false
  let _when_done: {()} val

  new create(when_done: {()} val) =>
    _when_done = when_done

  be wait(cpu: Cpu tag) =>
    _waiting.set(cpu)
    cpu.subscribe_halt({()(self = recover tag this end) =>
      self._done(cpu)
    })

  be start_waiting() =>
    _wait = true
    if _waiting.size() == 0 then
      _when_done()
    end

  be _done(cpu: Cpu tag) =>
    _waiting.unset(cpu)
    if _wait and (_waiting.size() == 0) then
      _when_done()
    end

derrickturk commented 1 year ago

I should also add that --ponynoblock still seems to remove "segmentation fault" from the set of possible outcomes, but doesn't change the termination behavior.

SeanTAllen commented 1 year ago

@derrickturk can you open an issue for "early termination", that sounds like a distinctly different problem for the probably cycle detector related segfault. I haven't paid a lot of attention to what you have reported for it, a good fresh summation on its own would be helpful.

SeanTAllen commented 1 year ago

I have a hard time reproducing the segfault issue using the last example. I have a very easy time reproducing the "early termination" issue. I almost never get output.

I have a much easier time reproducing all three cases with the original example in the zip file.

SeanTAllen commented 1 year ago

Using the code on my cd-contacted-bad branch, i am unable with either example to reproduce the segfault which I was hopeful would be the case as I fixed all race conditions I could find in interactions with the cycle detector.

@derrickturk very much, please do open a new issue for the termination problem and limit the information there-in to it.

SeanTAllen commented 1 year ago

So the "early termination" problem:

there's no printing going on as at the time the program ends, the waiting count is still rather high. And it is "all over the place".

Here's some output from my modified version.

Next up is to determine if bug in pony code or the runtime. I'm assuming runtime but I want to rule out the pony code first.

SeanTAllen commented 1 year ago

@redvers suggested during office hours to try running with --ponymaxthreads=1 and lo and behold, no "early termination". Why? Still unknown.

SeanTAllen commented 1 year ago

OK so as stated, the segfault is fixed by forthcoming changes.

The "early termination" is a "bug" in the program. The program works with a single ponythread because the message ordering with a single thread and how things are scheduled matches the sequential thinking behind the code. However, with more than 1 thread, the code is no longer sequential and invariants that the code doesn't state do not hold.

In particular there is an expectation that:

for every CpuWaiter.wait call to add a cpu to the set
there will be a corresponding CpuWaiter._done call

However this will only be true with a single thread where you happen to have our scheduling rules (which with a single thread will satisfy most expectations for sequential program order).

However, with more than 1 thread, things can get "out of order" from what the program seems to expect to happen. In particular, at the moment that:

    cpus(0)?.send(0)

is first called, our first cpu starts its "steps processing" almost immediately. With a single thread this is not the case and everything is "ordered as expected".

When the first cpu "starts its processing", then there are orderings where _wait can be called to add messages to the cpu set, and the cpu in question will already reached step 3 before it has received and processed the "subscribe_halt" message. This means that there will no longer be a 1 to 1 wait to done _ratio. What happens then is that when the cpu is at step 3, the on_halt matches fails because _on_halt is still None and no _done message is sent which means that this check will not pass:

  be _done(cpu: Cpu tag) =>
    _waiting.unset(cpu)
    if _wait and (_waiting.size() == 0) then
      _when_done()
    end

because _waiting.size() will never be able to reach 0.

For the program to work as expected the following that only happens with 1 thread MUST happen:

within the for iter in Range(0, 1000) do loop... we need this to happen:

waiter.wait(cpus(cpus.size() - 1)?) which causes to be on scheduler 1
cpus(0)?.send(0) which causes the corresponding cpu to be next in line on scheduler 1

then we have:

      for cpu in cpus.values() do
        cpu.run()
      end

which means that while main is still running...

cpu 0 that is already on scheduler 1 will get a run message
cpu 1 will be added the end of scheduler 1's queue because of the run message
cpu 2 will be added the end of scheduler 1's queue because of the run message
cpu 3 will be added the end of scheduler 1's queue because of the run message
cpu 4 will be added the end of scheduler 1's queue because of the run message

because of this ordering and corresponding orderings based on the scheduler implementation, there can never be the mismatch in set/unset calls in the waiter.

however, in a concurrent world, if we have 2 threads... the moment an actor hits scheduler 1's queue, it can be stolen by scheduler 2 and then, there are so many unknown orderings and all the race conditions in the code related to set/unset in the CpuWaiter can happen.

Note that with the deterministic scheduling with 1 thread, that 4000 times "_on_halt" won't be set, but that is ok, because it means that 1000 times it will be set and there's a corresponding 1000 wait calls that were made from main.

The easiest fix for the problem is to move the setting of on_halt to before the first time it needs to be used, so here in main:

      try
        cpus(cpus.size() - 1)?.subscribe(mk)
        waiter.wait(cpus(cpus.size() - 1)?)
      end

you have access to the waiter and can do the subscribe_halt call on the last item in cpus at that time and give it the waiter object that is in scope.

SeanTAllen commented 1 year ago

4251 fixes the segfault issue.

ponylang / ponyc

Segfault in cycle checker(?) #4193

4251 fixes the segfault issue.