oracle / truffleruby

A high performance implementation of the Ruby programming language, built on GraalVM.
https://www.graalvm.org/ruby/
Other
2.98k stars 180 forks source link

endless loop on test with celluloid #1833

Open HoneyryderChuck opened 4 years ago

HoneyryderChuck commented 4 years ago

https://travis-ci.org/swisscom/ruby-netsnmp/jobs/617962889?utm_medium=notification&utm_source=github_status

@eregon follow up of #1730

eregon commented 4 years ago

Thank you for filing an issue.

Relevant output:

$ spec/support/specs.sh run
...
with cellulloid
  behaves like an snmp client
    #get

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.
Check the details on how to adjust your build configuration on: https://docs.travis-ci.com/user/common-build-problems/#build-times-out-because-no-output-was-received
bjfish commented 4 years ago

I've tracked this down to this monitor.closed? which hangs and never calls the underlying C extension method: https://github.com/celluloid/celluloid-io/blob/49c1021b435073cd7bc9a394658d273488f4105e/lib/celluloid/io/reactor.rb#L52

I would suspect an issue relating to the mutex here: https://github.com/oracle/truffleruby/blob/3b756ceddba635edb4a5673ab999a26f427074de/lib/truffle/truffle/cext_ruby.rb#L37 The specs appear to pass when CEXT_LOCK is set to false.

eregon commented 4 years ago

@bjfish Interesting, could you show a Java thread dump (from Ctrl+) or Ruby thread dump (send SIGALRM to the process) when that happens? Probably something else is keeping the C-ext lock and never releasing it.

bjfish commented 4 years ago

@eregon Java thread dump: https://gist.github.com/bjfish/ab5b30e7a29c009d46a5db0d889d9df1 Ruby thread dump: https://gist.github.com/bjfish/8d4f1441f561038cef02c8fed9da876d

eregon commented 4 years ago

OK that shows:

"Ruby Fiber id=14 from /Users/brandonfish/Documents/oracle-workspace/gem_home/gems/celluloid-0.17.4/lib/celluloid/task/fibered.rb:9" #14 prio=5 os_prio=31 tid=0x00007f8a2c5a8800 nid=0x3f03 waiting on condition [0x000070000a0f0000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000006c07b3678> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
    at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:335)
    at org.truffleruby.core.mutex.MutexOperations.lambda$lockInternal$0(MutexOperations.java:40)
    at org.truffleruby.core.mutex.MutexOperations$$Lambda$806/1719946447.block(Unknown Source)
    at org.truffleruby.core.thread.ThreadManager.runUntilResult(ThreadManager.java:482)
    at org.truffleruby.core.mutex.MutexOperations.lockInternal(MutexOperations.java:39)
    at org.truffleruby.cext.CExtNodes$CallCWithMutexNode.callCWithMutex(CExtNodes.java:135)

is waiting for the C-ext lock, but this fiber:

"Ruby Thread id=16 from /Users/brandonfish/Documents/oracle-workspace/gem_home/gems/celluloid-0.17.4/lib/celluloid/group/spawner.rb:47" #16 prio=5 os_prio=31 tid=0x00007f8a29cec800 nid=0x3d07 waiting on condition [0x0000700009ff2000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000007837cd1b8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
    at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
    at org.truffleruby.core.fiber.FiberManager.lambda$waitForResume$3(FiberManager.java:236)
    at org.truffleruby.core.fiber.FiberManager$$Lambda$809/345437432.block(Unknown Source)
    at org.truffleruby.core.thread.ThreadManager.runUntilResultKeepStatus(ThreadManager.java:450)
    at org.truffleruby.core.fiber.FiberManager.waitForResume(FiberManager.java:234)
...
   Locked ownable synchronizers:
    - <0x00000007838c19f0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    - <0x00000006c07b3678> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

holds the C-ext locks and is waiting for the next Fiber#resume.

We need to figure out why it's still holding the C-ext lock (is there some C-ext code ending up in Fiber.yield?) and if it should have released it before. If it's actually a C-ext calling (can be indirectly) Fiber.yield then we need to figure where is a good place to release the C-ext lock.