Unexpected Parallelism Causing Degraded Performance #249

jcmartin commented 6 years ago

I am running into an issue where calling linearSolve or any other linear solve function causes more than one core to be used. The program is not compiled with -threaded. I have been able to duplicate this issue on multiple machines.

The following is a minimum program that demonstrates the issue.

module Main where

import Numeric.LinearAlgebra

vLength :: Double
vLength = 4096

m1 :: Matrix Double
m1 = fromList [1..vLength] `outer` fromList [1..vLength]

main :: IO ()
main = print $ m1 <\> fromList [1..vLength]

The cabal file used with LTS 10.4 build plan from Stackage.

name:                main
author:              jcmartin
build-type:          Simple
cabal-version:       >=1.10

executable main
  main-is:             Main.hs
  build-depends:       base >=4.9 && <5, hmatrix
  ghc-options:         -Wall -O2 -rtsopts
  default-language:    Haskell2010

The effect is most noticeable with a large matrix, but smaller matrices cause unexpected behavior as well. The following was a run on my local computer with vLength set to 2048. Important to note is that the elapsed time is shorter than the total time indicating that more than one thread was run simultaneously.

> stack exec -- main +RTS -s -RTS > /dev/null 
      98,134,536 bytes allocated in the heap
       1,061,792 bytes copied during GC
      33,599,568 bytes maximum residency (3 sample(s))
       1,056,688 bytes maximum slop
              68 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0        28 colls,     0 par    0.002s   0.011s     0.0004s    0.0068s
  Gen  1         3 colls,     0 par    0.000s   0.009s     0.0031s    0.0092s

  INIT    time    0.000s  (  0.015s elapsed)
  MUT     time   35.078s  ( 31.411s elapsed)
  GC      time    0.002s  (  0.020s elapsed)
  EXIT    time    0.000s  (  0.003s elapsed)
  Total   time   35.080s  ( 31.449s elapsed)

  %GC     time       0.0%  (0.1% elapsed)

  Alloc rate    2,797,600 bytes per MUT second

  Productivity 100.0% of total user, 99.9% of total elapsed

This behavior is undesirable as when it runs a machine without enough cores, the performance of the overall program is severely hurt. The desired behavior should be that the number of cores used is either configurable (runtime or compile time) or at the minimum fixed to one core.

I have been unable to duplicate this behavior with other libraries or code, so I am led to believe that this is an issue specific to hmatrix.

albertoruiz commented 6 years ago

Perhaps your blas/lapack external libraries automatically use multiple cores. In my machine, with non optimized blas/lapack, your program runs much slower but the elapsed time is almost equal to total time.