puniverse / quasar

Fibers, Channels and Actors for the JVM
http://docs.paralleluniverse.co/quasar/
Other
4.56k stars 574 forks source link

Messages to actors are lost under heavy load #204

Closed scameronde closed 8 years ago

scameronde commented 8 years ago

Sometimes under heavy load and using cascading messages, some messages seem to get lost. When monitoring the actors, the number of received messages for some actors are too low.

I have attached a small sample project. I have to run the sample 3-5 times for the problem to occur.

Please keep in mind, that I am new to Quasar and might be doing something completely wrong.

Test environment Macbook Pro:

Java:

bugreport.zip

circlespainter commented 8 years ago

Hi, I have tried a lot but I've been unable to reproduce it. My setup is: MBP5,1 (2.4G Core 2 Duo, 2 cores, HT is disabled in HW), 8GB RAM, OSX 10.11.5, Oracle JDK 1.8.0_92-b14.

I made only few minor changes, the most relevant ones are logging all actions, explicitly allocating 4G heap (with your constants it crashes because of OOM here) and removing Guava 15 from dependencies: Quasar depends on 19.0 and, since I think Maven uses a nearest-wins (tree-wise) conflict resolution strategy, perhaps this could cause problems.

Could you try again with my version?

quasar-204.zip

scameronde commented 8 years ago

Hi, I changed my setup step by step to eliminate potential culprits.

1) removed Guava 15 from dependencies: problem still exists 2) set max memory to 4 GB: problem still exists 3) set min memory to 4 GB: problem still exists 4) upgraded to JDK 1.8.0_92-b14: problem still exists 5) used your System.err.println statements: problem gone.

Looks like a timing issue, because the logging slows everything down. In a second test, I redirected the output to a file on a SSD with same result: the problem is gone. Then I removed most of your logging statements and voila: problem is back.

It smells like a race condition where you have to have bad luck with timing.

circlespainter commented 8 years ago

Yes and it must happen in a very small time slot because I can't reproduce it even with in-memory logging (attaching).

Thanks for investing time; if you can spend a bit more it would help to try out using with a 0 mailbox size. Less interesting but also with this new version, just to get some empirical info on the best race-finding setups (more cores could increase the chances).

quasar-204.zip

scameronde commented 8 years ago

I did two tests:

1) with your in-memory logging: still 1 out of 5 runs do not finish. 2) with a mailboxsize of 0 and BLOCK as overflow policy: 1 in 10 runs did not finish.

with 1) I had to increase the heap to 8GB, because I got OOMs with 4 GB

Exception in Fiber "fiber-10343926" Exception in Fiber "fiber-10250341" Exception in Fiber "fiber-10516198" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421) at java.lang.StringBuilder.append(StringBuilder.java:136) at eu.lindenbaum.lab.racetest.Util.log(Util.java:16) at eu.lindenbaum.lab.racetest.ChildActor.doRun(ChildActor.java:22) at eu.lindenbaum.lab.racetest.ChildActor.doRun(ChildActor.java:8) at co.paralleluniverse.actors.Actor.run0(Actor.java:691) at co.paralleluniverse.actors.ActorRunner.run(ActorRunner.java:51) at co.paralleluniverse.fibers.Fiber.run(Fiber.java:1027)

If you need further help identifying the problem, just tell me what I should do.

circlespainter commented 8 years ago

Thanks, are you using 10 parent actors and 100k children each?

circlespainter commented 8 years ago

Also, can you confirm from the logs that the issue is due to lost message and not something else and if so, Is it possible to tell how many are lost?

scameronde commented 8 years ago

I tried it with different numbers. 10 parents, 100k children. And 1k parents and 1k children. With the same result. It is the total number of childrens (actors) that seem to be the problem. Both cases have 1 million actors. When I go lower in numbers the problem happens less often, if I go higher it happens more often.

I did not use log messages to monitor the behavior, but the JMX console of JVisualVM / MissionControl. The numbers of received messages the parent actors are reporting via JMX is off. When everything is ok and all actors terminate, the reported number of received messages is correct. When some of the actors do not terminate, it is sometimes one actor that reports less than the expected number of received messages, sometimes it is two or three actors. Sometimes only one message for a parent actor is missing, sometimes 80%.

circlespainter commented 8 years ago

Thanks, we'll look into it.

scameronde commented 8 years ago

I investigated the situation a bit more by monitoring send and received messages. The messages that get lost are of type LifecycleMessage (ExitMessage).

The root actor watches 10 parent actors. Each parent actor watches 100,000 child actors. All child actors receive the shutdown message send by my code. All child actors do quit their doRun-Loop. I am using JMX counters to assure that the numbers are correct. When the problem occurs, most of the time 9 out of 10 parent actors receive the expected number of ExitMessages and only one parent actor has problems. The number of missing messages for this one parent actor is always a multiple of 10. I have seen 10, 20 or 30 messages missing.

Maybe this helps finding the problem.

circlespainter commented 8 years ago

Thanks, I think it does. Lifecycle messages are pretty normal messages except that they're produced and consumed by watch- and link-related logic (and filterMessage/handleLifecycleMessage). The problem could be there because if no normal message is ever lost then we can probably exclude a mailbox/queue issue.

circlespainter commented 8 years ago

Could you run again with Quasar's "flight recorder" turned on and share the output file? It'll basically record in-memory many internal events (including related to actors) and then dump them upon exit. You can enable it with -Dco.paralleluniverse.debugMode=true and have a look at more options here.

scameronde commented 8 years ago

I tried it with the following settings:

        <argument>-Dco.paralleluniverse.fibers.detectRunawayFibers=false</argument>
        <argument>-Dco.paralleluniverse.fibers.verifyInstrumentation=false</argument>
        <argument>-Dco.paralleluniverse.debugMode=true</argument>
        <argument>-Dco.paralleluniverse.flightRecorderDumpFile=quasar.log</argument>
        <argument>-Dco.paralleluniverse.monitoring.flightRecorderLevel=5</argument>
        <argument>-Dco.paralleluniverse.globalFlightRecorder=true</argument>
        <argument>-Dco.paralleluniverse.monitoring.flightRecorderSize=200000</argument>

The flight recorder slowed everything down from a few seconds runtime to more than an hour with heavy system load. The problem did not occur with two runs (I did not had the time for more runs yet).

How can I tweak the parameters to obtain better runtime, to increase the chance of the problem to occur and to squeeze more runs into one day?

circlespainter commented 8 years ago

Sorry, I didn't expect such long runtimes. Level 1 should already produce interesting information but if it doesn't help we can try with a custom Quasar build. Thanks for your support.

scameronde commented 8 years ago

When logging at level 1 and increasing the size of the flight recorder, the problem occurs again. The problem is, that the program does not terminate, because there are actors still running due to missing LifeCycle messages. It seems that the flight recorder log is only written to disk when the program ends.

My idea is to install a shutdown hook that catches a kill signal and to write the log via an api call. What API call do I have to make to force writing of the log?

pron commented 8 years ago

Hi @scameronde . We've been able to reproduce the bug on a different machine, and are looking into it.

scameronde commented 8 years ago

Hi. How is the fix going along? Did you manage to identify the cause of the problem?

pron commented 8 years ago

I haven't had the time to work on the issue lately, but I'll devote next week to fixing it as well as a couple other important Quasar bugs. I hope and expect you'll have a fix by the end of next week.

linkerlin commented 8 years ago

@pron ExecutorService will drop some messages if the queue of Callables is full.

pron commented 8 years ago

@scameronde The issue should be fixed now in 0.7.6-SNAPSHOT. It was a subtle race in the implementation of the unbounded queue used by the actor mailbox.

scameronde commented 8 years ago

Thanks! I will test it this weekend.