tests fail on linux. monitor receives noproc instead of normal, possible race?

tudborg commented 8 years ago

I think I've found a race in erlexec, but I have only been able to reproduce it on my linux server. On my exact box the race happens 50% of the time. erlexec tests also fails half the time.

The longer the external command takes, the less likely it seems to trigger. Here is some log output from both my own poking, and the failing test case:

https://gist.github.com/tbug/85c1378865bdd41d1e358452b5c0f43c

The box is an AWS m3.medium running Ubuntu 14.04

$ uname -a
Linux ip-10-0-71-110 3.13.0-93-generic #140-Ubuntu SMP Mon Jul 18 21:21:05 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

I havn't been able to pin-point where the race actually happens.

saleyn commented 8 years ago

Your "isolated test case" works as expected, so were you able to run exec:run("echo foobar", [monitor]). and never receive a DOWN message in the caller?

It doesn't look to me as a race. The problem with that test case is that it sets a deadline at 5s to receive the completion of the echo ok command exec.erl:1043, which doesn't happen. This could be due to the fact that there are no CPU resources on that AWS box, and it doesn't get to run the task in allocated time. Try to increase that timeout in the test from 5000 to, say, 15000, and see.

tudborg commented 8 years ago

Ah, I never thought of that. 😧 You are probably right. I went down the wrong rabbit hole.

Will test and get back.

On Aug 26, 2016 18:22, "Serge Aleynikov" notifications@github.com wrote:

Your "isolated test case" works as expected, so were you able to run exec:run("echo foobar", [monitor]). and never receive a DOWN message in the caller?

It doesn't look to me as a race. The problem with that test case is that it sets a deadline at 5s to receive the completion of the echo ok command exec.erl:1043, which doesn't happen. This could be due to the fact that there are no CPU resources on that AWS box, and it doesn't get to run the task in allocated time. Try to increase that timeout in the test from 5000 to, say, 15000, and see.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/saleyn/erlexec/issues/83#issuecomment-242735341, or mute the thread https://github.com/notifications/unsubscribe-auth/AAL7jG2Us9CZnllB2GYnOFmaD4APjYkyks5qjuoegaJpZM4Jt8jP .

tudborg commented 8 years ago

So I poked around some more. It doesn't seem to be a timeout issue.

In my own tests and the test suite, I do get a DOWN, but it's a noproc, not a normal as expected. My initial gist also shows this

{'DOWN',#Ref<0.0.1.102>,process,<0.55.0>,noproc}

I took the test suite and commented out all tests except test_monitor(), which looks like this

test_monitor() ->
    {ok, P, _} = exec:run("echo ok", [{stdout, null}, monitor]),
    ?receiveMatch({'DOWN', _, process, P, normal}, 5000).

I also tried increasing the timeout to 30000, but the result is the same. Checking the machine while the test is running, close to no cpu and memory is used, so it's not busy at all.

The ?receiveMatch never completes. However, if I change the normal to noproc, it matches immediately.

What makes this seem race-like, is that sometimes I get a normal reason from the DOWN. if I start exec with {verbose, true}, it always works (at least, I havn't been able to get noproc with verbose flag set. So I figured the verbose printing reduced the race.

In the test_monitor() I wrote a receive that accepts any DOWN reason and prints it out. Here is what I get

ubuntu@ip-10-0-71-110:/tmp/erlexec-bug-hut/erlexec$ make test
===> Verifying dependencies...
===> Compiling erlexec
===> Performing EUnit tests...
  Redirecting [stdin -> null]
  Redirecting [stdout -> null]
  Redirecting [stderr -> null]
Starting child: 'echo ok'
  child  = (stdin=null(fd:5), stdout=null(fd:5), stderr=null(fd:5))
  parent = (stdin=none, stdout=none, stderr=none)
  Args[0]: /bin/bash
  Args[1]: -c
  Args[2]: echo ok
Child process 11464 exited
* Process 11464 (ret=11464, status=0, sig=17, oktojump=1, exited_count=0 [exited])
/tmp/erlexec-bug-hut/erlexec/_build/test/lib/erlexec/src/exec.erl:1043:<0.101.0>:
DOWN reason noproc

.Broken Erlang command pipe (3): No such process
Setting alarm to 12 seconds
Got signal: 15 (oktojump=0)
Exiting (3)

Top 1 slowest tests (0.004 seconds, 4.3% of total time):
  exec:exec_test_/0:1025
    0.004 seconds

Finished in 0.094 seconds
1 tests, 0 failures

Here you see my print-out:

DOWN reason noproc

Here is what I changed the test_monitor() to

test_monitor() ->
    {ok, P, _} = exec:run("echo ok", [{stdout, null}, monitor]),
    receive {'DOWN', _, process, P, Reason} -> ?debugFmt("~nDOWN reason ~p~n", [Reason])
    after 5000 -> ?debugMsg("timeout")
    end.

So I do receive the down, it's just a noproc, which made me think that it is a race. It only seems to happen when the thing I run completes really fast (like echo ok in your test), or the machine is really busy.

So, tl;dr: I sometimes receive a DOWN noproc reason instead of a normal or {exit_status, Status} reason.

saleyn commented 8 years ago

Yes, it was a race condition. I pushed the fix.

tudborg commented 8 years ago

:+1:

Looking at the fix ( for reference: https://github.com/saleyn/erlexec/commit/bbc9297886828b19df0605370e122c2002736fc0#diff-dadb34d095a4c6a877818436e0295d69R740 )

What if we need to know if the external process exited with 0 or 1? Wouldn't this fix always look like a success due to returning normal reason?

saleyn commented 8 years ago

The underlying issue and the proper fix is actually more complex. This is a temporary bandaid until I can think of a better solution. The real problem is that erlang doesn't have a way of setting a monitor on behalf of a process, only the process itself can set up a monitor, and in the erlexec's architecture, creation of the erlang process associated with the OS pid that needs to be monitored happens in the context of the erlexec's gen_server. So only [link] option is truly reliable, because that one is initiated by the linked process, with proper guarantees. Likely, instead of using the monitor/2 function the propagation of exit has to be emulated in the ospid_loop by explicitly sending a message to the caller's Pid. Not that I like that solution either, but so far it's a more reliable option.

saleyn commented 8 years ago

I actually managed to get the proper fix in that eliminates the root cause of the race. Please check.

tudborg commented 8 years ago

Awesome! I have automated running the test_monitor() until it fails, so right now I'm waiting for an actual failure. (I've seen the timeout a few times, but this time I am pushing the machine, so I have increased the timeout and am now explicitly logging the DOWN message)

I havn't seen any races yet. I'll let this run for an additional half hour and see how it turns out, but as far as I can tell, the bug is fixed 👍

tudborg commented 8 years ago

I have been testing for about an hour in total, and havn't seen the bug. It seems like everything is running as expected.

saleyn / erlexec

tests fail on linux. monitor receives noproc instead of normal, possible race? #83