shashi / FileTrees.jl

Parallel computing with a tree of files metaphor
http://shashi.biz/FileTrees.jl
Other
88 stars 6 forks source link

Uncommunicative worker processes? #37

Closed ym-han closed 3 years ago

ym-han commented 3 years ago

I got the following error while trying to run FileTrees with multiple workers on an embarrassingly parallel problem. The code I was running is https://github.com/ym-han/gigaword_64k/blob/main/src/gigaword_64k.jl and https://github.com/ym-han/gigaword_64k/blob/main/src/afp_mapper.jl It's basically just filtering a tree, loading it, processing it, then saving it.

I did not experience any issues when doing it with just one node (i.e., without the parallelism). One thing i haven't tried yet is running this with even fewer workers (I had 5 - 6), since the amount of data I was processing wasn't actually that large; maybe that could have been the problem.

module: unloading 'julia/1.5.0'
module: loading 'julia/1.5.0'
 Activating environment at `/gpfs/scratch/yh31/projects/gigaword_64k/Project.toml`
ERROR: LoadError: On worker 2:
TaskFailedException:
peer 7 didn't connect to 2 within 59.99998998641968 seconds
error at ./error.jl:33
wait_for_conn at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:194
check_worker_state at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:168
send_msg_ at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:176
send_msg at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:134 [inlined]
#remotecall_fetch#143 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:389 [inlined]
remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:386
#remotecall_fetch#146 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421 [inlined]
#52 at /users/yh31/.julia/packages/MemPool/zSuoT/src/datastore.jl:333 [inlined]
forwardkeyerror at /users/yh31/.julia/packages/MemPool/zSuoT/src/datastore.jl:236
poolget at /users/yh31/.julia/packages/MemPool/zSuoT/src/datastore.jl:332
move at /users/yh31/.julia/packages/Dagger/k7zru/src/chunks.jl:88
move at /users/yh31/.julia/packages/Dagger/k7zru/src/chunks.jl:86 [inlined]
move at /users/yh31/.julia/packages/Dagger/k7zru/src/chunks.jl:92
macro expansion at /users/yh31/.julia/packages/Dagger/k7zru/src/sch/Sch.jl:510 [inlined]
#71 at ./task.jl:356
wait at ./task.jl:267 [inlined]
fetch at ./task.jl:282 [inlined]
_broadcast_getindex_evalf at ./broadcast.jl:648 [inlined]
_broadcast_getindex at ./broadcast.jl:621 [inlined]
getindex at ./broadcast.jl:575 [inlined]
copy at ./broadcast.jl:876
materialize at ./broadcast.jl:837 [inlined]
do_task at /users/yh31/.julia/packages/Dagger/k7zru/src/sch/Sch.jl:507
#106 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:294
run_work_thunk at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:79
macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:294 [inlined]
#105 at ./task.jl:356
remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:394
remotecall_fetch(::Function, ::Distributed.Worker, ::Int64, ::Vararg{Any,N} where N) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:386
remotecall_fetch(::Function, ::Int64, ::Int64, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
remotecall_fetch at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421 [inlined]
macro expansion at /users/yh31/.julia/packages/Dagger/k7zru/src/sch/Sch.jl:541 [inlined]
(::Dagger.Sch.var"#75#76"{Dagger.OSProc,Int64,FileTrees.var"#87#90",Tuple{Dagger.Chunk{JDF.JDFFile{String},MemPool.DRef,Dagger.ThreadProc},Dagger.Chunk{JDF.JDFFile{String},MemPool.DRef,Dagger.ThreadProc}},Channel{Any},Bool,Bool,Bool,Dagger.Sch.ThunkOptions,Array{Int64,1},Dagger.NoOpLog,Dagger.Sch.SchedulerHandle})() at ./task.jl:356
Stacktrace:
 [1] compute_dag(::Dagger.Context, ::Dagger.Thunk; options::Nothing) at /users/yh31/.julia/packages/Dagger/k7zru/src/sch/Sch.jl:208
 [2] compute(::Dagger.Context, ::Dagger.Thunk; options::Nothing) at /users/yh31/.julia/packages/Dagger/k7zru/src/compute.jl:31
 [3] compute at /users/yh31/.julia/packages/Dagger/k7zru/src/compute.jl:28 [inlined]
 [4] exec(::Dagger.Context, ::Dagger.Thunk) at /users/yh31/.julia/packages/FileTrees/ZfGJB/src/parallelism.jl:75
 [5] exec(::Dagger.Thunk) at /users/yh31/.julia/packages/FileTrees/ZfGJB/src/parallelism.jl:64
 [6] save(::gigaword_64k.var"#22#34", ::FileTrees.FileTree; lazy::Nothing, exec::Bool) at /users/yh31/.julia/packages/FileTrees/ZfGJB/src/values.jl:128
 [7] save at /users/yh31/.julia/packages/FileTrees/ZfGJB/src/values.jl:111 [inlined]
 [8] process_part_of_tree(::String, ::String, ::Int64) at /gpfs/scratch/yh31/projects/gigaword_64k/src/gigaword_64k.jl:138
 [9] top-level scope at /users/yh31/scratch/projects/gigaword_64k/src/afp_mapper.jl:29
 [10] include(::Function, ::Module, ::String) at ./Base.jl:380
 [11] include(::Module, ::String) at ./Base.jl:368
 [12] exec_options(::Base.JLOptions) at ./client.jl:296
 [13] _start() at ./client.jl:506
in expression starting at /users/yh31/scratch/projects/gigaword_64k/src/afp_mapper.jl:29
┌ Warning: Forcibly interrupting busy workers
│   exception = rmprocs: pids [6] not terminated after 5.0 seconds.
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1234
┌ Warning: rmprocs: process 1 not removed
└ @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.5/Distributed/src/cluster.jl:1030
DrChainsaw commented 3 years ago

Workers not connecting does not sound like a FileTrees issue.

Have you tried running something on the workers without using FileTrees or dagger, for example pmap(i -> (sleep(1); myid()), 1:20)) just to see that workers are really setup correctly?

ym-han commented 3 years ago

Thanks for pointing that out. I hadn't done that (I'm not very familiar with the Julia distributed computing stack). I'll close this for now.