jepst / CloudHaskell

A distributed computing framework for Haskell
http://hackage.haskell.org/package/remote
BSD 3-Clause "New" or "Revised" License
347 stars 22 forks source link

[MacOS]Promises lost when there are more promises than worker nodes? #14

Closed mgajda closed 12 years ago

mgajda commented 12 years ago

I have an embarassingly parallel computing task implemented in a CloudHaskell:

runTask $ do mapM (newPromise · mapper) inputs mapResult ← mapM readPromise pmapResult

This works ok, as long as there are as many WORKER processes, as (length inputs - 1), but if I have fewer, I get: 2012-08-23 10:57:06.030153 CEST 2 pid://localhost:57840/11/ SYS Process got unhandled exception TaskException "Failed promise redemption" 2012-08-23 10:57:06.030586 CEST 2 pid://localhost:57840/8/ SYS Process got unhandled exception ProcessMonitorException: pid://localhost:57840/11/ has terminated because SrException "TaskException \"Failed promise redemption\""

As tutorials are rather sparse, I wonder how may I fix it? (And why?)

Interestingly enough the exception doesn't appear on Linux, and everything completes okay.

Mac OS X also shows another exception for each node: 2012-08-23 10:56:57.86235 CEST 2 pid://localhost:57837/5/ SYS Process got unhandled exception bind: resource busy (Address already in use)

jepst commented 12 years ago

I should mention that I developed Cloud Haskell on Linux, so I don't know how well it works on MacOS. Having said that, I suspect that these two errors are related. It certainly should work, regardless of the number of worker nodes.

The second error shows that CH can't bind a network address, which could prevent it from redeeming promises. That may be a MacOS-specific issue, and I don't know if I can help with it. Can you send me the complete code of your program?

I would suggest trying the new CH implementation, in the distributed-process package, but I haven't yet ported the Task layer to it. I hope to do so in the next few weeks.

mgajda commented 12 years ago

I will try to send you the simplified test example. The whole program is probably too big to work comfortably with it just now ;-).

And when you want a naive user to test the new task layer on Mac OS X, and 48-processor nodes here - please let me know!

PS I checked that it works on Linux indeed.

mgajda commented 12 years ago

Here it is:

{-# LANGUAGE FlexibleInstances, BangPatterns, OverloadedStrings, ScopedTypeVariables #-} {-# LANGUAGE TemplateHaskell, KindSignatures, ImpredicativeTypes #-} module Main where

import Prelude hiding(String) import System.IO(stderr, hPutStrLn, hPutStr) import System.Environment(getArgs, getProgName) import Control.Monad(when, forM_)

import Control.Monad.IO.Class(liftIO)

import Remote -- CloudHaskell import Remote.Task(liftTaskIO, newPromise, readPromise, newPromiseAt, Locality(..))

import Control.DeepSeq import Data.DeriveTH import Data.Digest.Pure.MD5(MD5Digest) import Crypto.Classes import qualified Data.ByteString.Char8 as BS

-- | Print usage on the command line usage = do prog <- getProgName hPutStrLn stderr $ "Usage: " ++ prog ++ " ..."

-- Here comes CloudHaskell stuff... hashFromFileTask :: [Char] ->TaskM BS.ByteString hashFromFileTask fname = liftTaskIO $ do contents <-BS.readFile fname return $! BS.pack $ fname ++ md5 contents where md5 = show . (hash' :: BS.ByteString -> MD5Digest)

mergeResults inputs = return $! BS.concat inputs

$( remotable [ 'hashFromFileTask ] )

-- Here is a hand-made modification of mapReduce to do map in parallel, but fold on MASTER mapFold mapper reducer inputs = -- TODO: check if chunkify packages anything? do pmapResult <- mapM (newPromiseAt workerNodes . mapper) inputs mapResult <- mapM readPromise pmapResult reduced <- reducer mapResult return reduced where workerNodes = LcByRole ["WORKER"] -- all, including MASTER

-- | Get arguments, and run makeDB on them, and write -- resulting database into a single file.

initialProcess "MASTER" = do inputfiles <- getCfgArgs result <-runTask $! mapFold hashFromFileTask__closure mergeResults inputfiles liftIO $ BS.putStrLn result

initialProcess "WORKER" = receiveWait [] initialProcess _ = say "You need to start this program as either a MASTER or a WORKER. Set the appropiate value of cfgRole on the command line or in the config file."

main = remoteInit (Just "config") [Main.__remoteCallMetaData] initialProcess

{- SHELL SCRIPT FOR TESTING: ghc --make -rtsopts -with-rtsopts=-H64M\ -A2M -threaded TestCH.hs

cat >config <<EOF cfgRole MASTER cfgHostName localhost cfgKnownHosts localhost EOF

for i in seq 4; do ./TestCH -cfgRole=WORKER & done; sleep 1; time ./TestCH ../* -}

{- CABAL script for dependencies: name: TestCH version: 0.1 synopsis: Crashes CloudHaskell on Mac OS X description: Shows errors that do not usually occur on Linux. category: Data license: BSD3 --license-file: LICENSE author: Michal J. Gajda maintainer: mgajda@gwdg.de build-type: Simple cabal-version: >=1.12

executable TestCH default-language: Haskell2010 main-is: TestCH.hs ghc-options: -threaded -with-rtsopts=-H3G -rtsopts build-depends: base, binary, derive, filepath, bytestring, deepseq, crypto-api, pureMD5, remote, transformers -}

jepst commented 12 years ago

I suspect that the problem is in your config file. cfgHostName and cfgKnownHosts must be actual names or addresses, not aliases (like localhost). cfgHostName can be omitted; the framework will simply ask the OS. cfgKnownHosts should contain the name or address of the local system, obtained from the hostname command (as well as any other hosts where workers might be running).

For example, look at the kmeans script in examples/kmeans.

If changing the config file doesn't work, can you tell me if the KMeans3 example program works for you on MacOS?

mgajda commented 12 years ago

Thanks a lot for a prompt hints!

Unfortunately changing localhost to a result of hostname command didn't help.

I promise to test kmeans script after vacation in second half of September, since current version seems to enter shell respawning loop, and thus made me lose control over the remote Mac system I used for testing ;->.

jepst commented 12 years ago

What is a shell respawning loop?

mgajda commented 12 years ago

There appear hundreds of shell processes. Possibly because default filesystem is case insensitive.

jepst commented 12 years ago

Yes, that's the problem. The script, named kmeans (lower case) expects to call a binary named KMeans (initial case), but in fact calls itself.