orc-lang / orc

Orc programming language implementation
https://orc.csres.utexas.edu/
BSD 3-Clause "New" or "Revised" License
42 stars 3 forks source link

PorcE hang due to missing haltToken execution #223

Open arthurp opened 5 years ago

arthurp commented 5 years ago

The cause is currently unknown. TODO: Fill in details of the problem here.

arthurp commented 5 years ago

Some notes about my work on this:

The issue is almost certainly a case of a missing haltToken on a counter not an extra newToken. I think John already concluded this, but I can confirm because adding yields (Thread.yield() calls to change thread interleaving) around haltToken causes the the hang to happen less (a lot less in my test case), whereas adding yields around newToken doesn't.

This result also implies that the hang case involves code around a haltToken winning a race with another thread. I was not able to find code where adding yield makes the hang more likely, so I don't know anything about the other side of the race.

The hang only requires 2 software threads and 1 hardware thread. In fact the hang was most likely (at least with my configuration and test program) when running with a fixed 2 worker threads and all threads pinned to a single core. This implies that the race does not require tight interlacing (only a few instructions of one thread and then jumping back to the other).

The hang does not require that code be compiled by truffle. Kinda obvious given how fast it happens, but maybe useful to rule out some issues.

The hang only seems to happen during the return from recursive functions. I'm not sure of this (it could be the test case we are using), but there are definitely hints in terms of what is required to make the hang occur. Function calls in general have a lot of effect. For instance, some Orc functions are required for the hang (inlining the Orc function body eliminates the hang). This may imply that the issue has to do with how Orc functions are encoded in Porc.

Adding Sequentialize() (which disables the creation of most spawns in it's scope, including those from parallel) in some parts of the code doesn't change anything, but in others the it will eliminate the hang.

In some cases I found that a print would not prevent the hang, but a Logger call would. This makes me think that the locking involved in Logger are critical to preventing the hang.

Things that I think should be investigated further:

  1. Investigate when HaltException is being thrown and caught. The code to catch this is a bit adhoc, so it's possible a HaltException is being mishandled.
  2. Trace the PorcE nodes which execute in each thread by storing the node pointer itself in a thread local array. This would be very fast (so it would probably not prevent the hang) and would give a complete view of the PorcE level execution trace in both threads. This should allow localizing the race pretty accurately, but it would require est. 2 hours of coding to capture the data and several days to run it on different programs and analyze the resulting data to figure out what the critical part of the scheduling pattern is.
  3. Figuring out the other side of the race (the code that is racing with the haltToken related code) would probably really help. This could be investigated by "fuzzing" with yields until adding a yield makes the hang more frequent.

My testing configuration

https://gist.github.com/arthurp/85482e7768df3717bc2847f0f10e6f39

I wrote a script which runs the test repeatedly and modified the test case somewhat. I'm sure it will need hacking to make it work for other people, but it should provide a start.