slimgroup / JUDI.jl

Julia Devito inversion.
https://slimgroup.github.io/JUDI.jl
MIT License
96 stars 30 forks source link

UNHANDLED TASK ERROR #105

Closed ziyiyin97 closed 2 years ago

ziyiyin97 commented 2 years ago

MFE is below


using JUDI, JLD2

# Set up model structure
n = (600, 600)   # (x,y,z) or (x,z)
d = (10., 10.)
o = (0., 0.)

# Velocity [km/s]
v = ones(Float32,n) .+ 0.5f0
v0 = ones(Float32,n) .+ 0.5f0
v[:,Int(round(end/2)):end] .= 3.5f0
rho = (v0 .+ .5f0) ./ 2

# Slowness squared [s^2/km^2]
m = (1f0 ./ v).^2
m0 = (1f0 ./ v0).^2
dm = vec(m - m0)

# Setup info and model structure
nsrc = 4    # number of sources
model = Model(n, d, o, m)
model0 = Model(n, d, o, m0)

# Set up receiver geometry
nxrec = 120
xrec = range(50f0, stop=1150f0, length=nxrec)
yrec = 0f0
zrec = range(50f0, stop=50f0, length=nxrec)

# receiver sampling and recording time
timeR = 1000f0   # receiver recording time [ms]
dtR = 2f0    # receiver sampling interval [ms]

# Set up receiver structure
recGeometry = Geometry(xrec, yrec, zrec; dt=dtR, t=timeR, nsrc=nsrc)

# Set up source geometry (cell array with source locations for each shot)
xsrc = convertToCell(range(400f0, stop=800f0, length=nsrc))
ysrc = convertToCell(range(0f0, stop=0f0, length=nsrc))
zsrc = convertToCell(range(200f0, stop=200f0, length=nsrc))

# source sampling and number of time steps
timeS = 1000f0  # ms
dtS = 2f0   # ms

# Set up source structure
srcGeometry = Geometry(xsrc, ysrc, zsrc; dt=dtS, t=timeS)

# setup wavelet
f0 = 0.01f0     # kHz
wavelet = ricker_wavelet(timeS, dtS, f0)
q = judiVector(srcGeometry, wavelet)

# Set up info structure for linear operators
ntComp = get_computational_nt(srcGeometry, recGeometry, model)
info = Info(prod(n), nsrc, ntComp)

###################################################################################################

# Write shots as segy files to disk
opt = Options(isic=true)

# Setup operators
Pr = judiProjection(info, recGeometry)
F = judiModeling(info, model; options=opt)
F0 = judiModeling(info, model0; options=opt)
Ps = judiProjection(info, srcGeometry)
J = judiJacobian(Pr*F0*adjoint(Ps), q)

# Nonlinear modeling
# dobs = Pr*F*adjoint(Ps)*q
JLD2.@load "dobs.jld2" dobs

# evaluate LSRTM objective function
fj, gj = lsrtm_objective(model0, [q, q], [dobs, dobs], [dm, dm]; nlind=false, options=opt)

I did julia -p 2 -L ~/startup.jl with the startup file as

# Sytem informations

nthreads = parse(Int, ENV["OMP_NUM_THREADS"])
if nthreads*nworkers() > 8
    println("WARNING: allocating more ressources than available, ($(nthreads) threads and $(nworkers()) workers with only 8 cpus)")
end

threadlist=[0, 4, 1, 5, 2, 6, 3, 7]

k = myid()%nworkers()
ft = k*nthreads+1
lt = ft+nthreads-1
threads = threadlist[ft:lt]

ENV["OMP_DISPLAY_AFFINITY"]="true"
ENV["GOMP_CPU_AFFINITY"]=join(threads, " ")
println("using threads $(threads)")

terminal log is

[zyin62@eas-coda-fherr08 scripts]$ julia -p 2 -L ~/startup.jl 
using threads [2, 6, 3, 7]
      From worker 2:    using threads [0, 4, 1, 5]
      From worker 3:    using threads [2, 6, 3, 7]
               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.7.1 (2021-12-22)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(@v1.7) pkg> status JUDI
      Status `~/.julia/environments/v1.7/Project.toml`
  [f3b833dc] JUDI v2.6.8 `~/.julia/dev/JUDI`

julia> include("MFE.jl")
      From worker 2:    ┌ Warning: `vendor()` is deprecated, use `BLAS.get_config()` and inspect the output instead
      From worker 2:    │   caller = npyinitialize() at numpy.jl:67
      From worker 2:    └ @ PyCall ~/.julia/packages/PyCall/L0fLP/src/numpy.jl:67
      From worker 3:    ┌ Warning: `vendor()` is deprecated, use `BLAS.get_config()` and inspect the output instead
      From worker 3:    │   caller = npyinitialize() at numpy.jl:67
      From worker 3:    └ @ PyCall ~/.julia/packages/PyCall/L0fLP/src/numpy.jl:67
      From worker 2:    level 1 thread 0x7fbf3975e740 affinity 0,4
      From worker 2:    level 1 thread 0x7fbeeb9ce700 affinity 1,5
      From worker 2:    level 1 thread 0x7fbef41cf700 affinity 2,6
      From worker 2:    level 1 thread 0x7fbef49d0700 affinity 3,7
      From worker 3:    level 1 thread 0x7f65a8946740 affinity 0,4
      From worker 3:    level 1 thread 0x7f656c98d700 affinity 1,5
      From worker 3:    level 1 thread 0x7f656d18e700 affinity 2,6
      From worker 3:    level 1 thread 0x7f656d98f700 affinity 3,7
      From worker 2:    Operator `born` ran in 0.64 s
      From worker 3:    Operator `born` ran in 0.63 s
      From worker 2:    Operator `gradient` ran in 0.31 s
      From worker 3:    Operator `gradient` ran in 0.31 s
      From worker 2:    Operator `born` ran in 0.70 s
      From worker 3:    Operator `born` ran in 0.70 s
      From worker 2:    Operator `gradient` ran in 0.39 s
      From worker 3:    Operator `gradient` ran in 0.40 s
      From worker 2:    Operator `born` ran in 0.37 s
      From worker 2:    Operator `gradient` ran in 0.31 s
      From worker 3:    Operator `born` ran in 0.49 s
      From worker 3:    Operator `gradient` ran in 0.26 s
      From worker 2:    Operator `born` ran in 0.37 s
      From worker 2:    Operator `gradient` ran in 0.53 s
      From worker 3:    Operator `born` ran in 0.65 s
      From worker 3:    Operator `gradient` ran in 0.23 s
      From worker 2:    UNHANDLED TASK ERROR: On worker 1:
      From worker 2:    On worker 3:
      From worker 2:    peer 2 has not connected to 3
      From worker 2:    Stacktrace:
      From worker 2:     [1] error
      From worker 2:       @ ./error.jl:33
      From worker 2:     [2] wait_for_conn
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:192
      From worker 2:     [3] #23
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:180
      From worker 2:     [4] exec_conn_func
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:181
      From worker 2:     [5] exec_conn_func
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:175
      From worker 2:     [6] #106
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278
      From worker 2:     [7] run_work_thunk
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
      From worker 2:     [8] macro expansion
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278 [inlined]
      From worker 2:     [9] #105
      From worker 2:       @ ./task.jl:423
      From worker 2:    Stacktrace:
      From worker 2:     [1] #remotecall_fetch#155
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:469
      From worker 2:     [2] remotecall_fetch
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:461 [inlined]
      From worker 2:     [3] #remotecall_fetch#158
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:496 [inlined]
      From worker 2:     [4] remotecall_fetch
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:496 [inlined]
      From worker 2:     [5] #19
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:167
      From worker 2:     [6] #106
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278
      From worker 2:     [7] run_work_thunk
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
      From worker 2:     [8] macro expansion
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278 [inlined]
      From worker 2:     [9] #105
      From worker 2:       @ ./task.jl:423
      From worker 2:    Stacktrace:
      From worker 2:     [1] remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Int64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      From worker 2:       @ Distributed ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:469
      From worker 2:     [2] remotecall_fetch
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:461 [inlined]
      From worker 2:     [3] #remotecall_fetch#158
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:496 [inlined]
      From worker 2:     [4] remotecall_fetch
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:496 [inlined]
      From worker 2:     [5] (::Distributed.var"#18#21"{Distributed.Worker})()
      From worker 2:       @ Distributed ./task.jl:423
      From worker 2:    UNHANDLED TASK ERROR: On worker 1:
      From worker 2:    On worker 3:
      From worker 2:    peer 2 has not connected to 3
      From worker 2:    Stacktrace:
      From worker 2:     [1] #24
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:183
      From worker 2:     [2] exec_conn_func
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:181
      From worker 2:     [3] exec_conn_func
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:175
      From worker 2:     [4] #106
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278
      From worker 2:     [5] run_work_thunk
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
      From worker 2:     [6] macro expansion
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278 [inlined]
      From worker 2:     [7] #105
      From worker 2:       @ ./task.jl:423
      From worker 2:    Stacktrace:
      From worker 2:     [1] #remotecall_fetch#155
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:469
      From worker 2:     [2] remotecall_fetch
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:461 [inlined]
      From worker 2:     [3] #remotecall_fetch#158
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:496 [inlined]
      From worker 2:     [4] remotecall_fetch
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:496 [inlined]
      From worker 2:     [5] #19
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:167
      From worker 2:     [6] #106
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278
      From worker 2:     [7] run_work_thunk
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
      From worker 2:     [8] macro expansion
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278 [inlined]
      From worker 2:     [9] #105
      From worker 2:       @ ./task.jl:423
      From worker 2:    Stacktrace:
      From worker 2:     [1] remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Int64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      From worker 2:       @ Distributed ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:469
      From worker 2:     [2] remotecall_fetch
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:461 [inlined]
      From worker 2:     [3] #remotecall_fetch#158
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:496 [inlined]
      From worker 2:     [4] remotecall_fetch
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:496 [inlined]
      From worker 2:     [5] (::Distributed.var"#18#21"{Distributed.Worker})()
      From worker 2:       @ Distributed ./task.jl:423
      From worker 2:    UNHANDLED TASK ERROR: On worker 1:
      From worker 2:    On worker 3:
      From worker 2:    peer 2 has not connected to 3
      From worker 2:    Stacktrace:
      From worker 2:     [1] #24
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:183
      From worker 2:     [2] exec_conn_func
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:181
      From worker 2:     [3] exec_conn_func
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:175
      From worker 2:     [4] #106
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278
      From worker 2:     [5] run_work_thunk
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
      From worker 2:     [6] macro expansion
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278 [inlined]
      From worker 2:     [7] #105
      From worker 2:       @ ./task.jl:423
      From worker 2:    Stacktrace:
      From worker 2:     [1] #remotecall_fetch#155
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:469
      From worker 2:     [2] remotecall_fetch
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:461 [inlined]
      From worker 2:     [3] #remotecall_fetch#158
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:496 [inlined]
      From worker 2:     [4] remotecall_fetch
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:496 [inlined]
      From worker 2:     [5] #19
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/cluster.jl:167
      From worker 2:     [6] #106
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278
      From worker 2:     [7] run_work_thunk
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
      From worker 2:     [8] macro expansion
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:278 [inlined]
      From worker 2:     [9] #105
      From worker 2:       @ ./task.jl:423
      From worker 2:    Stacktrace:
      From worker 2:     [1] remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Int64}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
      From worker 2:       @ Distributed ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:469
      From worker 2:     [2] remotecall_fetch
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:461 [inlined]
      From worker 2:     [3] #remotecall_fetch#158
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:496 [inlined]
      From worker 2:     [4] remotecall_fetch
      From worker 2:       @ ~/GATechBundle/Julia/julia-1.7.1/share/julia/stdlib/v1.7/Distributed/src/remotecall.jl:496 [inlined]
      From worker 2:     [5] (::Distributed.var"#18#21"{Distributed.Worker})()
      From worker 2:       @ Distributed ./task.jl:423
(576634.4f0, PhysicalParameter{Float32}[[-14.895155, -34.88797, -52.451332, -54.490524, -35.469727, -1.2343807, 37.44786, 72.84346, 102.36653, 126.52875  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [-14.895113, -34.887924, -52.451294, -54.490498, -35.46977, -1.2344112, 37.447815, 72.8434, 102.36653, 126.528915  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]])

Note that the error doesn't occur when model size n is small, nor first generate dobs by forward modelling and then do lsrtm_objective.

Any idea what causes the error?

mloubout commented 2 years ago

Can you try not to use -p and use addporcs disabling lazy comm setup

ziyiyin97 commented 2 years ago

Add

using Distributed
addprocs(2; lazy=false)

solves the problem indeed. But ... why?

mloubout commented 2 years ago

In this not actually erroring, yu get your result properly. It is just telling you, and the log is quite explicit, that the communication between two workers hasn't been initialized. So it complains about it then initializes it and continue doing what it's supposed to. I also mentioned it in the julia issue you commented on as well so please avoid duplicating issues when on already provided an answer.

ziyiyin97 commented 2 years ago

But can I still do thread pinning after addprocs(2; lazy=true)?

mloubout commented 2 years ago

Yes, check the FWI tutorial https://github.com/slimgroup/ConstrainedFWIExamples/blob/master/notebooks/01_constr_fwi_judi.ipynb

ziyiyin97 commented 2 years ago

Thanks closing due to existing issue here https://github.com/JuliaLang/julia/issues/43627