Running BlackBoxOptim on a Cluster

pedm commented 6 years ago

Hello,

I am working with BlackBoxOptim on a slurm cluster, using ClusterManager to manage the additional processes. Unfortunately BlackBoxOptim does not run the distance function on the additional nodes.

In this situation, the distance function is only run on the processes on the same node as the master. Do you know if there's any way to fix this? A small sample code is below, containing everything but the distance function. Thank you!


using ClusterManagers
nnodes = 3
ncores = 16
np = nnodes * ncores
@time addprocs(SlurmManager(np), t="00:30:00")

@everywhere using BlackBoxOptim
opt2 = bbsetup(distance_fcn; Method=:dxnes, SearchRange = collect(values(EST_variables)),
              NumDimensions = 2, MaxFuncEvals = 3000, Workers = workers())

robertfeldt commented 6 years ago

Hmm, unfortunately I have very little experience running this on ClusterManagers but I think you might need to use prelim code in one of our PRs to get this working. Maybe @alyst better knows the current status? We sure want to get this up and running as soon as we have made the move to 0.7/1.0 so would be great to have more detailed test cases and feedback from you at that points @pedm

robertfeldt commented 6 years ago

Ah, sorry for being terse. PR = Pull Requests, we have two different ones related to parallel evaluation during optimization, see:

https://github.com/robertfeldt/BlackBoxOptim.jl/pull/46

and

https://github.com/robertfeldt/BlackBoxOptim.jl/pull/25

The former (46) is closer to a final stage but still needs some work. We probably will prioritize getting a stable version up for 0.7/1.0 as soon as they are out but @alyst knows more about the status of these PRs.

JulienPascal commented 4 years ago

Hi,

I also finds that BlackBoxOptim does not use all available workers when used on a cluster with ClusterManagers. With the example below, only workers 1 - 14 are used. Any idea why it is the case? Thank you.

using Distributed
using ClusterManagers

addWorkers = true
OnCluster = true

n_nodes = 2
n_cores_per_node = 10
maxNumberWorkers = round(Int, n_nodes*n_cores_per_node)

if addWorkers == true
    if OnCluster == true && n_nodes > 1
        print("Multiple nodes: using SlurmManager")
        addprocs(SlurmManager(maxNumberWorkers))
    else
        print("Single node")
        addprocs(maxNumberWorkers)
    end
end

@everywhere using Distributed

# Check the way workers are spread on nodes
# (relevant if on a cluster)
#------------------------------------------
hosts = []
pids = []
for i in workers()
    host, pid = fetch(@spawnat i (gethostname(), getpid()))
    println("Hello I am worker $(i), my host is $(host)")
    push!(hosts, host)
    push!(pids, pid)
end

# check the number of workers:
#----------------------------
currentWorkers = nworkers()
println("Number of workers = $(currentWorkers)")

@everywhere using BlackBoxOptim
@everywhere function slow_rosenbrock(x)
  sleep(0.001) # Fake a slower func to be optimized...
  println("I am worker $(myid())")
  println("I am worker $(gethostname())")
  return BlackBoxOptim.rosenbrock(x)
end
opt = bboptimize(slow_rosenbrock, Method=:dxnes, SearchRange = (-5.0, 5.0),
              NumDimensions = 50, MaxFuncEvals = 100000, Workers = workers())

res = best_candidate(opt)
print("Minimizer: $(res)")

print("Best fitness: $(best_fitness(opt))")

julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, nehalem)

robertfeldt / BlackBoxOptim.jl

Running BlackBoxOptim on a Cluster #84