Open arthurp opened 5 years ago
Some notes about my work on this:
The issue is almost certainly a case of a missing haltToken on a counter not an extra newToken. I think John already concluded this, but I can confirm because adding yields (Thread.yield() calls to change thread interleaving) around haltToken causes the the hang to happen less (a lot less in my test case), whereas adding yields around newToken doesn't.
This result also implies that the hang case involves code around a haltToken winning a race with another thread. I was not able to find code where adding yield makes the hang more likely, so I don't know anything about the other side of the race.
The hang only requires 2 software threads and 1 hardware thread. In fact the hang was most likely (at least with my configuration and test program) when running with a fixed 2 worker threads and all threads pinned to a single core. This implies that the race does not require tight interlacing (only a few instructions of one thread and then jumping back to the other).
The hang does not require that code be compiled by truffle. Kinda obvious given how fast it happens, but maybe useful to rule out some issues.
The hang only seems to happen during the return from recursive functions. I'm not sure of this (it could be the test case we are using), but there are definitely hints in terms of what is required to make the hang occur. Function calls in general have a lot of effect. For instance, some Orc functions are required for the hang (inlining the Orc function body eliminates the hang). This may imply that the issue has to do with how Orc functions are encoded in Porc.
Adding Sequentialize() (which disables the creation of most spawns in it's scope, including those from parallel) in some parts of the code doesn't change anything, but in others the it will eliminate the hang.
In some cases I found that a print would not prevent the hang, but a Logger call would. This makes me think that the locking involved in Logger are critical to preventing the hang.
Things that I think should be investigated further:
https://gist.github.com/arthurp/85482e7768df3717bc2847f0f10e6f39
I wrote a script which runs the test repeatedly and modified the test case somewhat. I'm sure it will need hacking to make it work for other people, but it should provide a start.
The cause is currently unknown. TODO: Fill in details of the problem here.