haskell-distributed / distributed-process-platform

DEPRECATED (Cloud Haskell Platform) in favor of distributed-process-extras, distributed-process-async, distributed-process-client-server, distributed-process-registry, distributed-process-supervisor, distributed-process-task and distributed-process-execution
http://haskell-distributed.github.com
BSD 3-Clause "New" or "Revised" License
47 stars 17 forks source link

log child shutdown errors in terminateChildren #91

Closed tavisrudd closed 10 years ago

tavisrudd commented 10 years ago

Fixes DPP-98 on JIRA.

hyperthunk commented 10 years ago

Thanks @tavisrudd - I'll try and get it merged this week.

hyperthunk commented 10 years ago

I've seen one intermittent failure (1 run out of 1000) here. It is on a branch test, and as the NOTICE points out, these rely on non-guaranteed ordering semantics, so it's possibly not a problem, but we should keep an eye on it.

t4@guest-10-190:distributed-process-platform $ ./dist/build/SupervisorTests/SupervisorTests +RTS -N
NOTICE: Branch Tests (Relying on Non-Guaranteed Message Order) Can Fail Intermittently
Supervisor Processes:
>>>>>>>>>>>>>> [snip]
    Restart Left:
      Restart Left, Left To Right (Sequential) Restarts: [OK]
      Restart Left, Leftmost Child Dies: [OK]
      Restart Left, Left To Right Stop, Left To Right Start: [Failed]

Expected: equalTo pid://127.0.0.1:8080:0:2625
     but: was pid://127.0.0.1:8080:0:2627
      Restart Left, Right To Left Stop, Right To Left Start: [OK]
      Restart Left, Left To Right Stop, Reverse Start: [OK]
      Restart Left, Right To Left Stop, Reverse Start: [OK]
    Restart Right:
      Restart Right, Left To Right (Sequential) Restarts: [OK]
      Restart Right, Rightmost Child Dies: [OK]
      Restart Right, Left To Right Stop, Left To Right Start: [OK]
      Restart Right, Right To Left Stop, Right To Left Start: [OK]
      Restart Right, Left To Right Stop, Reverse Start: [OK]
      Restart Right, Right To Left Stop, Reverse Start: [OK]
  Restart Intensity:
    Three Attempts Before Successful Restart: [OK]
    Permanent Child Exceeds Restart Limits: [OK]
  ToChildStart Link Setup:
    Both Local Process Instances Link Appropriately: [OK]

         Test Cases   Total       
 Passed  70           70          
 Failed  1            1           
 Total   71           71        
tavisrudd commented 10 years ago

I've seen the following failures a few times:

Supervisor Processes: Stopping And Deleting Children: Sequential Shutdown Ordering: [Failed] expected the shutdown order to hold

NOTICE: Branch Tests (Relying on Non-Guaranteed Message Order) Can Fail Intermittently

Mon Mar 17 01:36:56 UTC 2014 [trace] MxReceived pid://127.0.0.1:8080:0:10 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\EOTjob1\NUL" :: (2bef152fd819b3fd,9f700da2acc86729) Mon Mar 17 01:36:56 UTC 2014 [trace] MxProcessDied pid://127.0.0.1:8080:0:10 (DiedException "exit-from=pid://127.0.0.1:8080:0:10,reason=timing is out - job1 isn't registered yet") Mon Mar 17 01:36:58 UTC 2014 [trace] MxProcessDied pid://Task Execution And Prioritisation: Each execution blocks the submitter: [OK] Only 'max' tasks can proceed at any time: [Failed] ERROR: thread blocked indefinitely in an MVar operation Crashing Tasks are Reported Properly: [OK]

     Test Cases  Total

Passed 2 2 Failed 1 1 Total 3 3 127.0.0. Test suite TaskQueueTests: FAIL

Tue Mar 18 20:41:07 UTC 2014 [trace] MxReceived pid://127.0.0.1:8080:0:29 "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\EOTjob2\SOH" :: (5187ee24bb3438de,9efee0a8a7e7c95) Tue Mar 18 20:41:07 UTC 2014 [trace] MxProcessDied pid://127.0.0.1:8080:0:18 DiedNormal Tue Mar 18 20:41:07 UTC 2014 [trace] MxSpawned pid://127.0.0.1:8080:0:30 Tue Mar 18 20:41:07 UTC 2014 [trace] MxProcessDied pid://127.0.0.1:8080:0:29 DiedNormal Tue Mar 18 20:41:07 UTC 2014 [trace] MxSent pid://127.0.0.1:8080:0:21 pid://127.0.0.1:8080:0:16 [unencoded message] :: CallResponse (Either ExitReason [Char]) Tue Mar 18 20:41:07 UTC 2014 [trace] MxReceived pid://127.0.0.1:8080:0:21 [unencoded message] :: CallResponse (Either ExitReason [Char]) Tue Mar 18 20:41:07 UTC 2014 [trace] MxProcessDied pid://127.0.0.1:8080:0:28 DiedNormal Tue Mar 18 20:41:07 UTC 2014 [trace] MxProcessDied pid://127.0.0.1:8080:0:21 DiedNormal Tue Mar 18 20:41:07 UTC 2014 [trace] MxProcessDied pid://127.0.0.1:8080:0:30 DiedNormal Tue Mar 18 20:41:07 UTC 2014 [trace] MxSpawned pid://127.0.0.1:8080:0:31 Tue Mar 18 20:41:07 UTC 2014 [trace] MxProcessDied pid://127.0.0.1:8080:0:31 DiedNormal Tue Mar 18 20:41:07 UTC 2014 [trace] MxProcessDied pid://127.0.0.1:8080:0:20 DiedNormal Tue Mar 18 20:41:07 UTC 2014 [trace] MxProcessDied pid://127.0.0.1:8080:0:13 DiedNormal Tue Mar 18 20:41:Task Execution And Prioritisation: Each execution blocks the submitter: [OK] Only 'max' tasks can proceed at any time: [OK] Crashing Tasks are Reported Properly: [Failed] expected the server to report an error

On May 9, 2014, at 5:58 AM, Tim Watson notifications@github.com wrote:

I've seen one intermittent failure (1 run out of 1000) here. It is on a branch test, and as the NOTICE points out, these rely on non-guaranteed ordering semantics, so it's possibly not a problem, but we should keep an eye on it.

t4@guest-10-190:distributed-process-platform $ ./dist/build/SupervisorTests/SupervisorTests +RTS -N NOTICE: Branch Tests (Relying on Non-Guaranteed Message Order) Can Fail Intermittently Supervisor Processes:

[snip] Restart Left: Restart Left, Left To Right (Sequential) Restarts: [OK] Restart Left, Leftmost Child Dies: [OK] Restart Left, Left To Right Stop, Left To Right Start: [Failed]

Expected: equalTo pid://127.0.0.1:8080:0:2625 but: was pid://127.0.0.1:8080:0:2627 Restart Left, Right To Left Stop, Right To Left Start: [OK] Restart Left, Left To Right Stop, Reverse Start: [OK] Restart Left, Right To Left Stop, Reverse Start: [OK] Restart Right: Restart Right, Left To Right (Sequential) Restarts: [OK] Restart Right, Rightmost Child Dies: [OK] Restart Right, Left To Right Stop, Left To Right Start: [OK] Restart Right, Right To Left Stop, Right To Left Start: [OK] Restart Right, Left To Right Stop, Reverse Start: [OK] Restart Right, Right To Left Stop, Reverse Start: [OK] Restart Intensity: Three Attempts Before Successful Restart: [OK] Permanent Child Exceeds Restart Limits: [OK] ToChildStart Link Setup: Both Local Process Instances Link Appropriately: [OK]

     Test Cases   Total       

Passed 70 70
Failed 1 1
Total 71 71
— Reply to this email directly or view it on GitHub.

hyperthunk commented 10 years ago

ERROR: thread blocked indefinitely in an MVar operation

That looks to me like a bug in the test code. There is no code blocking on MVars in the task queues after all, so my assumption is that something has crashed or ceased communicating with the coordinating thread, leaving the test case unable to proceed (and thankfully, generating a runtime deadlock warning out of the RTS). That could be (is probably!?) indicative of a bug, but we need to track down the source of the failure. I'll try and look at it this week.