Closed tudborg closed 8 years ago
Your "isolated test case" works as expected, so were you able to run exec:run("echo foobar", [monitor]).
and never receive a DOWN
message in the caller?
It doesn't look to me as a race. The problem with that test case is that it sets a deadline at 5s to receive the completion of the echo ok
command exec.erl:1043
, which doesn't happen. This could be due to the fact that there are no CPU resources on that AWS box, and it doesn't get to run the task in allocated time. Try to increase that timeout in the test from 5000 to, say, 15000, and see.
Ah, I never thought of that. 😧 You are probably right. I went down the wrong rabbit hole.
Will test and get back.
On Aug 26, 2016 18:22, "Serge Aleynikov" notifications@github.com wrote:
Your "isolated test case" works as expected, so were you able to run exec:run("echo foobar", [monitor]). and never receive a DOWN message in the caller?
It doesn't look to me as a race. The problem with that test case is that it sets a deadline at 5s to receive the completion of the echo ok command exec.erl:1043, which doesn't happen. This could be due to the fact that there are no CPU resources on that AWS box, and it doesn't get to run the task in allocated time. Try to increase that timeout in the test from 5000 to, say, 15000, and see.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/saleyn/erlexec/issues/83#issuecomment-242735341, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL7jG2Us9CZnllB2GYnOFmaD4APjYkyks5qjuoegaJpZM4Jt8jP .
So I poked around some more. It doesn't seem to be a timeout issue.
In my own tests and the test suite,
I do get a DOWN
, but it's a noproc
, not a normal
as expected.
My initial gist also shows this
{'DOWN',#Ref<0.0.1.102>,process,<0.55.0>,noproc}
I took the test suite and commented out all tests except test_monitor()
, which looks like this
test_monitor() ->
{ok, P, _} = exec:run("echo ok", [{stdout, null}, monitor]),
?receiveMatch({'DOWN', _, process, P, normal}, 5000).
I also tried increasing the timeout to 30000, but the result is the same. Checking the machine while the test is running, close to no cpu and memory is used, so it's not busy at all.
The ?receiveMatch
never completes.
However, if I change the normal
to noproc
, it matches immediately.
What makes this seem race-like, is that sometimes I get a normal
reason from the DOWN.
if I start exec
with {verbose, true}
, it always works (at least, I havn't been able to get noproc
with verbose
flag set. So I figured the verbose printing reduced the race.
In the test_monitor()
I wrote a receive
that accepts any DOWN
reason and prints it out.
Here is what I get
ubuntu@ip-10-0-71-110:/tmp/erlexec-bug-hut/erlexec$ make test
===> Verifying dependencies...
===> Compiling erlexec
===> Performing EUnit tests...
Redirecting [stdin -> null]
Redirecting [stdout -> null]
Redirecting [stderr -> null]
Starting child: 'echo ok'
child = (stdin=null(fd:5), stdout=null(fd:5), stderr=null(fd:5))
parent = (stdin=none, stdout=none, stderr=none)
Args[0]: /bin/bash
Args[1]: -c
Args[2]: echo ok
Child process 11464 exited
* Process 11464 (ret=11464, status=0, sig=17, oktojump=1, exited_count=0 [exited])
/tmp/erlexec-bug-hut/erlexec/_build/test/lib/erlexec/src/exec.erl:1043:<0.101.0>:
DOWN reason noproc
.Broken Erlang command pipe (3): No such process
Setting alarm to 12 seconds
Got signal: 15 (oktojump=0)
Exiting (3)
Top 1 slowest tests (0.004 seconds, 4.3% of total time):
exec:exec_test_/0:1025
0.004 seconds
Finished in 0.094 seconds
1 tests, 0 failures
Here you see my print-out:
DOWN reason noproc
Here is what I changed the test_monitor()
to
test_monitor() ->
{ok, P, _} = exec:run("echo ok", [{stdout, null}, monitor]),
receive {'DOWN', _, process, P, Reason} -> ?debugFmt("~nDOWN reason ~p~n", [Reason])
after 5000 -> ?debugMsg("timeout")
end.
So I do receive the down, it's just a noproc, which made me think that it is a race.
It only seems to happen when the thing I run completes really fast (like echo ok
in your test),
or the machine is really busy.
So,
tl;dr:
I sometimes receive a DOWN
noproc
reason instead of a normal
or {exit_status, Status}
reason.
Yes, it was a race condition. I pushed the fix.
:+1:
Looking at the fix ( for reference: https://github.com/saleyn/erlexec/commit/bbc9297886828b19df0605370e122c2002736fc0#diff-dadb34d095a4c6a877818436e0295d69R740 )
What if we need to know if the external process exited with 0
or 1
?
Wouldn't this fix always look like a success due to returning normal
reason?
The underlying issue and the proper fix is actually more complex. This is a temporary bandaid until I can think of a better solution.
The real problem is that erlang doesn't have a way of setting a monitor on behalf of a process, only the process itself can set up a monitor, and in the erlexec's architecture, creation of the erlang process associated with the OS pid that needs to be monitored happens in the context of the erlexec's gen_server. So only [link]
option is truly reliable, because that one is initiated by the linked process, with proper guarantees. Likely, instead of using the monitor/2
function the propagation of exit has to be emulated in the ospid_loop by explicitly sending a message to the caller's Pid. Not that I like that solution either, but so far it's a more reliable option.
I actually managed to get the proper fix in that eliminates the root cause of the race. Please check.
Awesome!
I have automated running the test_monitor()
until it fails, so right now I'm waiting for an actual failure. (I've seen the timeout a few times, but this time I am pushing the machine, so I have increased the timeout and am now explicitly logging the DOWN
message)
I havn't seen any races yet. I'll let this run for an additional half hour and see how it turns out, but as far as I can tell, the bug is fixed 👍
I have been testing for about an hour in total, and havn't seen the bug. It seems like everything is running as expected.
I think I've found a race in erlexec, but I have only been able to reproduce it on my linux server. On my exact box the race happens 50% of the time. erlexec tests also fails half the time.
The longer the external command takes, the less likely it seems to trigger. Here is some log output from both my own poking, and the failing test case:
https://gist.github.com/tbug/85c1378865bdd41d1e358452b5c0f43c
The box is an AWS
m3.medium
running Ubuntu 14.04I havn't been able to pin-point where the race actually happens.