Open SkorikGG opened 4 years ago
Pull request #71 added.
Could you explain why this "degrades performance"? In your example it doesn't matter which order we execute things in, so long as we keep all the cores busy.
A few percent performance drop can be seen in this example:
data IVList a = Nil | Cons a (IVar (IVList a))
type Stream a = IVar (IVList a)
parFoldMapBuffer :: Int -> (c -> b -> Par c) -> (a -> Par b) -> c -> [a] -> Par c
parFoldMapBuffer n f g z0 xs0 = do
sr <- new
mswxs<- start n sr xs0
folder sr z0 mswxs
where
start _ sw [] = put_ sw Nil >> return Nothing
start 0 sw xs = return $ Just (sw,xs)
start !m sw (x:xs) = do
sw'<-new
ivy<- new
put_ sw (Cons ivy sw')
fork $ g x >>= put_ ivy
start (m-1) sw' xs
folder sr !z mswxs = do
se <-get sr
case se of
Nil -> return z
Cons ivy sr' -> do
y<-get ivy
z'<- f z y
mswxs'<- case mswxs of
Nothing -> return Nothing
Just (sw,[]) -> put_ sw Nil >> return Nothing
Just (sw,x:xs) -> do
sw'<-new
ivy'<- new
put_ sw (Cons ivy' sw')
fork $ g x >>= put_ ivy'
return $ Just (sw',xs)
folder sr' z' mswxs'
fn :: Int -> Double -> Double
fn 0 !x = x
fn !n !x = fn (n-1) $ sin $ 3* x
g :: Double -> Double
g x = fn 10000 x
test :: Int -> Int -> Double
test m n = runPar $ parFoldMapBuffer 128 (\x y -> return $ fn 2500 (x+y)) (return . g . fromIntegral) 0.0 [m..n]
main = do
t1<- getTime Realtime
print $ test 0 50000
t2<- getTime Realtime
print $ fromIntegral (sec t2 - sec t1) + 1e-9 * fromIntegral (nsec t2 - nsec t1)
Some uneven activity of the processor cores is noticeable in eventlog. Eventlog.zip
In addition, the new commit in the pull request #71 making scheduler behavior more intuitively expected. With this change, the following function works properly (with full parallelism):
data IVList a = Nil | Cons a (IVar (IVList a))
type Stream a = IVar (IVList a)
parFoldMap :: (c -> b -> Par c) -> (a -> Par b) -> c -> [a] -> Par c
parFoldMap f g z0 xs0 = do
sr <- new
fork $ producer sr xs0
consumer sr z0
where
producer sw [] = put_ sw Nil
producer sw (x:xs) = do
sw'<-new
ivy<- new
put_ sw (Cons ivy sw')
fork $ g x >>= put_ ivy
producer sw' xs
consumer sr !z = do
se <-get sr
case se of
Nil -> return z
Cons ivy sr' -> do
y<-get ivy
z'<- f z y
consumer sr' z'
For comparison, the work of spark pool:
parFoldMap2 :: (c -> b -> c) -> (a -> b) -> c -> [a] -> c
parFoldMap2 f g z xs = ys `pseq` foldl' f z ys where
ys = start ys' ys'
ys'= map g xs
start [] ys = ys
start (x:xs) ys = x `par` start xs ys
It's been a while since I looked at monad-par
in detail so apologies if I'm being a bit slow here. Could you explain why we get more parallelism with your change?
My intuition is that there isn't any clearly "better" scheduling strategy without knowing something about the workload, which the scheduler can't have any prior knowledge of. So demonstrating an improvement on one example isn't sufficient evidence to change the scheduling strategy - there will be other examples that get slower. You have to make the case either that
For example:
On a quad-core processor, execution of forks will occur in the order of f1,f2,f3,f4,f16,f15,14,..., f5. Thus, the time to execute f5 will reach when the remaining threads are processed. This behavior can degrade performance and reduce parallelism due to the long wait for f5 results.