ocaml-multicore / domainslib

Parallel Programming over Domains
ISC License
173 stars 30 forks source link

possible bug in parallel_for_reduce #33

Closed nilsbecker closed 3 years ago

nilsbecker commented 3 years ago

i might just be dense here:

#require "domainslib"
open Domainslib
let pool = Task.setup_pool ~num_additional_domains:3
let ar = [|1;2;3;4;|]
let res init = Task.parallel_for_reduce pool ~chunk_size:2 ~start:0 ~finish:3
    ~body:(fun i -> ar.(i)) (+) init
[res 0; res 10]

i get [10;30] but i believe it should be [10;20] . i'm guessing it's in line 79 in task.ml where the global init is used to initialize chunks that start somewhere in the middle? https://github.com/ocaml-multicore/domainslib/blob/0a3b4f3bb64bd2f14c53518c31282ef421b7858c/lib/task.ml#L79

domainslib 0.3.0 on ocaml 4.12.0+domains

nilsbecker commented 3 years ago

setting chunk_size to 1 further shows the issue, giving the result 50.

nilsbecker commented 3 years ago

maybe the semantics should be such that the initializing value is always body start where start is the beginning index of the chunk? right now the signature seems to be more like a fold operation, but then it's unclear why init must be of the same type.

something like :

let parallel_for_reduce' ?(chunk_size=0) ~start ~finish ~body pool reduce_fun =
  let chunk_size = if chunk_size > 0 then chunk_size
    else begin
      let n_domains = (Array.length pool.domains) + 1 in
      let n_tasks = finish - start + 1 in
      if n_domains = 1 then n_tasks
      else max 1 (n_tasks / (8 * n_domains))
    end
  in
  let rec work s e =
    if e - s < chunk_size then
      let rec loop i acc =
        if i >= e then acc
        else loop (i+1) (reduce_fun acc (body (i+1)))
      in
      loop s (body s)
    else begin
      let d = s + (e - s) / 2 in
      let p = async pool (fun _ -> work s d) in
      let right = work (d+1) e in
      let left = await pool p in
      reduce_fun left right
    end
  in
  work start finish

gives me expected results

nilsbecker commented 3 years ago

also, it seems that the reduce function is required to be associative, as chunks are reduced and combined in undefined order; indeed using the above parallel_for_reduce' with the non-associative let arithmetic_mean a b = (a + b) / 2 instead of (+) again gives results that depend on chunk size. this requirement should be documented clearly.

nilsbecker commented 3 years ago

for parallel_scan i am assuming that it should behave like a cumulative Array.fold_left initialized with the first array element. so i tried this:

let ar = Array.init 10 Fun.id

let sequential op =
  let open Array in
  init (length ar) (fun i ->
      fold_left op ar2.(0) (sub ar2 1 i)) ;;

let scanned op = Task.parallel_scan pool op ar2
[sequential arithmetic_mean; scanned arithmetic_mean; sequential (+); scanned (+)]

the result are the same only for (+).

it also seems that to be able to do the fold-like operation in parallel, parallel_scan needs to essentially do all operations twice; the second time for adding the correct offset to chunks; i assume this is the best one could do to parallelize the sequential fold-list but is a performance concern that should be documented?

it seems a parallel_map over arrays would a simpler, quite natural abstraction?

Sudha247 commented 3 years ago

Hi @nilsbecker, thanks for the bug report! I have submitted a patch #34 attempting to fix the issue in parallel_for_reduce.

also, it seems that the reduce function is required to be associative, as chunks are reduced and combined in undefined order; indeed using the above parallel_for_reduce' with the non-associative let arithmetic_mean a b = (a + b) / 2 instead of (+) again gives results that depend on chunk size. this requirement should be documented clearly.

That's fair. #34 includes a doc update to mention this is the case for parallel_for_reduce and parallel_scan.

it seems a parallel_map over arrays would a simpler, quite natural abstraction?

Indeed, and I believe it would be fairly simple to implement a parallel_map with the parallel_for abstraction. In my head, a primitive version would look something like this:

let parallel_map f l pool =
  let arr = Array.of_list l in
  let res = Array.make (Array.length arr) (f arr.(0)) in
  Domainslib.Task.parallel_for ~start:0 ~finish:(Array.length arr - 1) 
    ~body:(fun i -> res.(i) <- f arr.(i)) pool;
  Array.to_list res
nilsbecker commented 3 years ago

hi, the fix looks reasonable. one could still consider if it should be stated that parallel_scan does twice the work that a sequential fold_left would do.

some thoughts about the proposed parallel_map:

Sudha247 commented 3 years ago

The bug in parallel_for_reduce has been fixed by #34, so closing this issue.

Agree with your points about parallel map, an interface like the one below could be more suitable:

val parallel_map : Domainslib.Task.pool -> ('a -> 'b) -> 'a array -> 'b array

If you think addition of parallel map will be useful on domainslib, please feel free to open a new issue for the same.