partest CPU utilization declines thus making ant test.suite take too long

scabug commented 12 years ago

The snapshots below depict system load while running ant test.suite. Although coarse, they suggest that over half of the test.suite run is I/O bound, for the number of tasks running in parallel, unlike the first quarter or so which is CPU bound.

The data:

snapshot 1: Except for a small bump all CPUs are fully loaded during files/pos and files/neg tests.
snapshot 2: By the time files/run tests are being run, the CPUs have little to do during the second half of that interval.
snapshot 3: CPU utilization recovers to around 100% (for a short while) after the batch of files/jvm tests start, then stays at values below 20%. With the buildmanager tests it recovers again, followed by another decline (the decline lasts longer than the recovery). Same pattern with the scalacheck tests
snapshot 4: Finally, files/scalap, files/specialized, and files/presentation tests run at 50% utilization.

Perhaps we can get shorter wall-clock time by running two partests in parallel (don't laugh), mixing I/O bound and CPU bound loads. Additionally, spawning 8 parallel tasks underutilizes some cores on some machines.

In any case, any solution which doesn't require meddling with partest internals seems better than the alternatives.

scabug commented 12 years ago

Imported From: https://issues.scala-lang.org/browse/SI-5955?orig=1 Reporter: @magarciaEPFL Attachments:

jvisualvm.png (created on Jun 21, 2012 1:01:54 PM UTC, 110812 bytes)
snapshot1.png (created on Jun 19, 2012 9:52:38 PM UTC, 108277 bytes)
snapshot2.png (created on Jun 19, 2012 9:52:38 PM UTC, 150614 bytes)
snapshot3.png (created on Jun 19, 2012 9:52:38 PM UTC, 144043 bytes)
snapshot4.png (created on Jun 19, 2012 9:52:38 PM UTC, 235064 bytes)

scabug commented 12 years ago

@axel22 said: On snapshot 1, the bump looks like the time when pos test group stops and neg test group starts. (maybe)

scabug commented 12 years ago

@axel22 said: It would be nice if the graphs showed IO activity too.

However, I don't think that the underutilized CPU time is due to IO. My hypothesis is that when a test group ends, there are a few tests which take longer than the others.

Besides, aren't the tests from one test group in partest equally divided among several actors and then each of these actors runs all of the tests in its group?

If that is the case, then the workload is not evenly divided between processors - there should be a central work item queue from which the actors take tests one-by-one. If the uneven workload is really the cause, this should improve utilization and testing time.

scabug commented 12 years ago

@axel22 said: There you go:

DirectRunner.scala:

    val workers = kindFiles.grouped(groupSize).toList map { toTest =>
      val worker = new Worker(fileManager, TestRunParams(scalaCheckParentClassLoader))
      worker.start()
      worker ! RunTests(kind, toTest)
      worker
    }

Worker.scala:

  def act() {
    react {
      case RunTests(testKind, files) =>
        val master = sender
        kind = testKind
        runTests(files) { results =>
          master ! Results(results.toMap)
          resetAll()
        }
    }
  }

scabug commented 12 years ago

@magarciaEPFL said (edited on Jun 21, 2012 1:07:04 PM UTC): Yes, I/O can't be the main explanation for the low CPU utilization, workload division is more likely. Although classloading (unpickling, reflection) and classwriting (GenJVM, less so for GenASM) take time.

Regarding even division of the workload:

jvisualvm reveals (picture below) most worker threads in the ANT process waiting for something (yellow). I'm talking about ForkjoinPool-1-worker-1 to ForkjoinPool-1-worker-12. Are they partest's actors?
Not sure about the distribution of running times for tests (that is, once they've been compiled, where most of the I/O takes place). After "visual inspection" the longest I noticed was ctrie taking a minute and a half. Doesn't seem that a few tests monopolize the run.

Another observation: run partest --run and jvisualvm. Do you see those Scala logos on the top left? Those are tests running in parallel. Select "Show live threads only" to keep the timeline uncluttered. Except at the beginning, there are about 5 logos at any given time. That's too low, even on an 8-core machine.

!jvisualvm.png!

scabug commented 12 years ago

@axel22 said:

Yes, those are workers threads created by the forkjoin pool which is in turn used by the actors.
Not sure, if --verbose option were to print out when a Worker actor above ends its work, we'd know in the output exactly when each of the actors ends and if there are some actors which take a bit longer. We can try this out easily - NestUI.verbose("actor at: " + Thread.currentThread + " done with test group") at the end of the act method in Worker.scala, and running test/partest with --verbose.

Running partest and visualvm now.

scabug commented 12 years ago

@axel22 said: I usually have 3-7 tests in parallel, on average 4 tests at a time. I have a 4-core machine with hyperthreading, so I think this makes sense on my workstation.

Some tests take much longer than the others.

scabug commented 12 years ago

@axel22 said: There is one difference between run and pos. pos test group never starts a new JVM instance. Perhaps starting processes is what causes such a CPU underutilization.

scabug commented 12 years ago

@magarciaEPFL said: Yes, that explains why pos and neg stay at 100% CPU utilization.

Starting dedicated JVMs avoids races in scalac, but that doesn't mean we should wait with idle CPUs for that to finish (more workers can't saturate the I/O bandwith, rather improve throughput instead).
Looks like the granularity of tests (except pos and neg) is such that they could be taken one at a time from a queue (a single queue actually). Together with the above we would get optimal throughput then.

scabug commented 12 years ago

@axel22 said:

I think the granularity argument goes in favour of pos and neg too - I think each file here is big enough to be taken as a task from a single queue and not even notice synchronization impact.

scabug commented 12 years ago

@magarciaEPFL said: I'm trying to increase the number of worker threads, like this:

./partest -Dactors.corePoolSize=32 -Dpartest.actors=32 -Dactors.maxPoolSize=64 -Dactors.enableForkJoin=true --run --debug

Output:

Info: actors.corePoolSize not defined

Testing interpreter and backend
Info: initializing scala.actors.Scheduler$@10d09ad3...
Info: scala.actors.scheduler.ThreadPoolConfig$@250d593e: java.version = 1.6.0_30
Info: scala.actors.scheduler.ThreadPoolConfig$@250d593e: java.vm.vendor = Sun Microsystems Inc.
Info: scala.actors.scheduler.ForkJoinScheduler@4026e9f9: parallelism 12
Info: scala.actors.Scheduler$@10d09ad3: 
      starting new scala.actors.scheduler.ForkJoinScheduler@4026e9f9 [class scala.actors.scheduler.ForkJoinScheduler]

Highlights:

how can actors.corePoolSize be set?
how can the ForkJoinScheduler's parallelism be raised? (in the snippet above, 12).

scabug commented 12 years ago

@magarciaEPFL said: It's strange. When running NestRunner all by itself CPU usage stays all the time close to 100% , unlike when running ant test.suite.

A few data points (8-core machine, no warm-up, no averaging over multiple runs, relative times are what counts):

Baseline: running partest --run results in
```
All of 986 tests were successful (elapsed time: 00:13:13)
```
with 4 GB peak memory use.

running scala.tools.partest.nest.NestRunner --run with

-Dactors.corePoolSize=32 -Dpartest.actors=32 -Dactors.maxPoolSize=64 -Dactors.enableForkJoin=true

results in

All of 986 tests were successful (elapsed time: 00:15:56)

with 5 GB peak memory use.

running NestRunner --run with only -Dactors.enableForkJoin=false results in
```
All of 986 tests were successful (elapsed time: 00:13:09)
```
with 6 GB peak memory use. Regarding pool size, --debug informs us that:
```
Info: actors.corePoolSize not defined
```

Testing interpreter and backend Info: initializing scala.actors.Scheduler$@351563ff... Info: Thread[Thread-1,5,main]: corePoolSize = 12, maxPoolSize = 256 Info: scala.actors.Scheduler$@351563ff: starting new Thread[Thread-1,5,main] [class scala.actors.scheduler.ResizableThreadPoolScheduler]


Background info, quoting from `scala.actors.Actor`

If actors use blocking operations (for example, methods for blocking I/O),
there are several options:
- The run-time system can be configured to use a larger thread pool size
(for example, by setting the actors.corePoolSize JVM property).
- The scheduler method of the Actor trait can be overridden to return a
ResizableThreadPoolScheduler, which resizes its thread pool to
avoid starvation caused by actors that invoke arbitrary blocking methods.
- The actors.enableForkJoin JVM property can be set to false, in which
case a ResizableThreadPoolScheduler is used by default to execute actors.

Summing up:

|| NestRunner --run ||elapsed time|| | ForkJoinScheduler, parallelism == 12 | 13 minutes | | ForkJoinScheduler, parallelism == 32 | 16 minutes | | ResizableThreadPoolScheduler | 13 minutes |

scabug commented 12 years ago

@magarciaEPFL said: The following bash script keeps CPU utilization close to 100% (except at the very end) shaving 5 minutes off the test suite on an 8-core machine:

./partest --pos --neg --jvm --res --buildmanager --scalacheck --script --shootout --presentation & 
./partest --run
wait   # for subshells to finish.

With 16 cores a viable partitioning would be:

--pos --neg
--jvm --res --buildmanager --scalacheck --script --shootout --presentation
--run

scabug commented 12 years ago

@paulp said: I hadn't seen this ticket. You guys might want to know that at some point I ripped the actors out of partest and deleted the pointless layers of indirection. Like miguel says,

    Test suite time on my 8-core machine goes from 28:21 to 22:34.

I don't remember the condition of this branch but I pushed it in case anyone is interested.

https://github.com/paulp/scala/tree/topic/partest-no-actors

scabug commented 12 years ago

@paulp said: Now I've refreshed my memory by building that branch. All the tests pass, and the test suite ran 7 minutes faster than the current test suite ever runs for me. The only issue is a large output spew from the scalacheck tests. I got sidetracked trying to change partest to just use scalacheck like a regular person rather than playing the ridiculous reflection dance, and when that was rejected because somebody in east timor uses partest outside the compiler and doesn't want an actual dependency on scalacheck, I promptly forgot the matter entirely.

So alex, since you're participating in this discussion, and I think the scalacheck tests are mostly yours, maybe you would like to take a peek at it...

scabug commented 12 years ago

@paulp said: Maybe THIS will fill you with excitement and new motivation:

BUILD SUCCESSFUL
Total time: 23 minutes 36 seconds

scabug commented 12 years ago

@axel22 said (edited on Jul 10, 2012 12:59:15 PM UTC): Saw your changes in that branch, they make sense definitely. Will build and run the thing now to figure out what this scalacheck output is all about.

Apparently, Som Snytt is volunteering to take a look at this.

scabug commented 12 years ago

@axel22 said: I see what you mean, I'll check this out.

scabug commented 12 years ago

@magarciaEPFL said: PR discussion at https://github.com/scala/scala/pull/870

Merged in e5c3d78b5d25a4d5356bf58cd5d3f1d41b55ee12

scala / bug

partest CPU utilization declines thus making ant test.suite take too long #5955