puniverse / pulsar

Fibers, Channels and Actors for Clojure
http://docs.paralleluniverse.co/pulsar/
Other
911 stars 53 forks source link

NullPointerException on channel receive #31

Closed mnussbaum closed 9 years ago

mnussbaum commented 9 years ago

I believe we've found a race condition that is triggered by invoking core/rcv on the channel returned by a Pulsar async/thread. When iterating through a large list of channels and invoking rcv on them we occasionally see NullPointerException raised by rcv. If we slow down the rate at which we map rcv on the list we don't see the exception. The channels is the list are the output of Pulsar's async/thread method, and as far we can tell totally uniform. We are executing rcv inside of a Pulsar actor.

I've included the stack trace:

18268 [ForkJoinPool-default-fiber-pool-worker-29] INFO co.paralleluniverse.actors.behaviors.ServerActor - Exception! 
java.lang.NullPointerException                                                                                       
        at co.paralleluniverse.strands.Strand.park(Strand.java:485)                                                  
        at co.paralleluniverse.strands.ConditionSynchronizer.await(ConditionSynchronizer.java:54)                    
        at co.paralleluniverse.strands.channels.QueueChannel.receive(QueueChannel.java:342)                          
        at co.paralleluniverse.pulsar.core$rcv.invoke(core.clj:584)                                                  
        at co.paralleluniverse.pulsar.InstrumentedIFn.invoke(InstrumentedIFn.java:36)                                
        at elasticsearch_loader.elasticsearch$check_for_bulk_response_error.invoke(elasticsearch.clj:207)            
        at co.paralleluniverse.pulsar.InstrumentedIFn.invoke(InstrumentedIFn.java:36)                                
        at clojure.core$map$fn__4245.invoke(core.clj:2557)                                                           
        at clojure.lang.LazySeq.sval(LazySeq.java:40)                                                                
        at clojure.lang.LazySeq.seq(LazySeq.java:49)                                                                 
        at clojure.lang.RT.seq(RT.java:484)                                                                          
        at clojure.core$seq.invoke(core.clj:133)                                                                     
        at clojure.core$apply.invoke(core.clj:624)                                                                   
        at clojure.core$mapcat.doInvoke(core.clj:2586)                                                               
        at clojure.lang.RestFn.invoke(RestFn.java:423)                                                               
        at elasticsearch_loader.elasticsearch$wait_for_bulk_responses.invoke(elasticsearch.clj:218)                  
        at co.paralleluniverse.pulsar.InstrumentedIFn.invoke(InstrumentedIFn.java:36)                                
        at elasticsearch_loader.merchant_actor$stop_reload_BANG_.invoke(merchant_actor.clj:88)                       
        at co.paralleluniverse.pulsar.InstrumentedIFn.invoke(InstrumentedIFn.java:36)                                
        at elasticsearch_loader.merchant_actor$merchant_actor$reify__8248.handle_cast(merchant_actor.clj:154)        
        at co.paralleluniverse.pulsar.actors$Server$reify__2666.handleCast(actors.clj:618)                           
        at co.paralleluniverse.actors.behaviors.ServerActor.handleCast(ServerActor.java:325)                         
        at co.paralleluniverse.actors.behaviors.ServerActor.handleMessage(ServerActor.java:220)                      
        at co.paralleluniverse.actors.behaviors.ServerActor.behavior(ServerActor.java:194)                           
        at co.paralleluniverse.actors.behaviors.BehaviorActor.doRun(BehaviorActor.java:286)                          
        at co.paralleluniverse.actors.behaviors.BehaviorActor.doRun(BehaviorActor.java:34)                           
        at co.paralleluniverse.actors.Actor.run0(Actor.java:667)                                                     
        at co.paralleluniverse.actors.ActorRunner.run(ActorRunner.java:51)                                           
        at co.paralleluniverse.actors.Actor.run(Actor.java:236)                                                      
        at co.paralleluniverse.fibers.Fiber.run(Fiber.java:1003)                                                     
        at co.paralleluniverse.fibers.Fiber.run1(Fiber.java:998)                                                     
        at co.paralleluniverse.fibers.Fiber.exec(Fiber.java:710)                                                     
        at co.paralleluniverse.fibers.FiberForkJoinScheduler$FiberForkJoinTask.exec1(FiberForkJoinScheduler.java:248)
        at co.paralleluniverse.concurrent.forkjoin.ParkableForkJoinTask.doExec(ParkableForkJoinTask.java:116)        
        at co.paralleluniverse.concurrent.forkjoin.ParkableForkJoinTask.exec(ParkableForkJoinTask.java:73)           
        at jsr166e.ForkJoinTask.doExec(ForkJoinTask.java:261)                                                        
        at jsr166e.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:988)                                             
        at jsr166e.ForkJoinPool.runWorker(ForkJoinPool.java:1628)                                                    
        at jsr166e.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Unfortunately we've yet to be able to reproduce the issue on anything less then our full project. Any suggestions welcome.

Thanks!

circlespainter commented 9 years ago

Turning on instrumentation verification by setting the system property -Dco.paralleluniverse.fibers.verifyInstrumentation=true doesn't highlight anything abnormal? Could you try with the 0.6.3-SNAPSHOT version of Quasar as well?

mnussbaum commented 9 years ago

Turning on instrumentation highlighted a stack of lazy functions we were using, starting from a for loop. Upon removing the lazy sequence iteration in favor of a loop/recur the issue appears to go away. Thanks for the pointer in the right direction!