slimgroup / JUDI.jl

Julia Devito inversion.
https://slimgroup.github.io/JUDI.jl
MIT License
94 stars 29 forks source link

Sudden `segfault` when doing calculating FWI on the cloud #214

Closed kerim371 closed 8 months ago

kerim371 commented 8 months ago

Hi,

I do calculations on the cloud (master node and 4 computational nodes, standard SSH cluster manager, CentOS 7).

Starting from yesterday I begin to receive segmentation fault. Before that moment probably a week I haven't encountered this error:

      From worker 3:    Operator `forward` ran in 8.50 s
      From worker 5:    Operator `forward` ran in 8.55 s
      From worker 2:    Operator `forward` ran in 8.02 s
      From worker 4:    Operator `forward` ran in 8.30 s
      From worker 4:
      From worker 4:    [9258] signal (11.1): Segmentation fault
      From worker 4:    in expression starting at none:0
      From worker 4:    sgemm_itcopy_SKYLAKEX at /home/kerim/shared_app/julia/julia-1.9.3/bin/../lib/julia/libopenblas64_.so (unknown line)
      From worker 4:    sgemm_nn at /home/kerim/shared_app/julia/julia-1.9.3/bin/../lib/julia/libopenblas64_.so (unknown line)
      From worker 4:    sgemm_64_ at /home/kerim/shared_app/julia/julia-1.9.3/bin/../lib/julia/libopenblas64_.so (unknown line)
      From worker 4:    gemm! at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/blas.jl:1524
      From worker 4:    gemm_wrapper! at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:674
      From worker 4:    mul! at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:161 [inlined]
      From worker 4:    mul! at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:276 [inlined]
      From worker 4:    * at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/LinearAlgebra/src/matmul.jl:148 [inlined]
      From worker 4:    SincInterpolation at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Utils/auxiliaryFunctions.jl:553
      From worker 4:    macro expansion at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Utils/auxiliaryFunctions.jl:527 [inlined]
      From worker 4:    macro expansion at ./timing.jl:393 [inlined]
      From worker 4:    macro expansion at /home/kerim/.julia/packages/JUDI/JEsVr/src/JUDI.jl:141 [inlined]
      From worker 4:    time_resample at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Utils/auxiliaryFunctions.jl:523
      From worker 4:    time_resample at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Utils/auxiliaryFunctions.jl:547 [inlined]
      From worker 4:    post_process at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Modeling/time_modeling_serial.jl:61
      From worker 4:    unknown function (ip: 0x7f6a7d452b42)
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    time_modeling at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Modeling/time_modeling_serial.jl:52
      From worker 4:    unknown function (ip: 0x7f6adaf3c1d8)
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    propagate at /home/kerim/.julia/packages/JUDI/JEsVr/src/TimeModeling/Modeling/propagation.jl:9
      From worker 4:    unknown function (ip: 0x7f6adaf306d6)
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 4:    jl_f__call_latest at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/builtins.c:774
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 4:    do_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/builtins.c:730
      From worker 4:    #invokelatest#2 at ./essentials.jl:819
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 4:    do_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/builtins.c:730
      From worker 4:    invokelatest at ./essentials.jl:816
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 4:    do_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/builtins.c:730
      From worker 4:    #107 at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:281
      From worker 4:    run_work_thunk at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:70
      From worker 4:    unknown function (ip: 0x7f6adaf2e8b9)
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    run_work_thunk at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/usr/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:79
      From worker 4:    #100 at ./task.jl:514
      From worker 4:    unknown function (ip: 0x7f6adaf2e47f)
      From worker 4:    _jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
      From worker 4:    ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
      From worker 4:    jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
      From worker 4:    start_task at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/task.c:1092
      From worker 4:    Allocations: 83909332 (Pool: 83868309; Big: 41023); GC: 241
[ Info: Line search failed
         2          5          5          9     0.00000e+00     3.05662e-01     2.81569e+06     1.25000e-01
Step size: 0.00e+00 below progTol: 1.00e-10
Worker 4 terminated.
      From worker 3:    Operator `forward` ran in 8.10 s
Unhandled Task ERROR: IOError: read: connection reset by peer (ECONNRESET)
Stacktrace:
  [1] wait_readnb(x::Sockets.TCPSocket, nb::Int64)
    @ Base ./stream.jl:410
  [2] (::Base.var"#wait_locked#715")(s::Sockets.TCPSocket, buf::IOBuffer, nb::Int64)
    @ Base ./stream.jl:949
  [3] unsafe_read(s::Sockets.TCPSocket, p::Ptr{UInt8}, nb::UInt64)
    @ Base ./stream.jl:955
  [4] unsafe_read
    @ ./io.jl:761 [inlined]
  [5] unsafe_read(s::Sockets.TCPSocket, p::Base.RefValue{NTuple{4, Int64}}, n::Int64)
    @ Base ./io.jl:760
  [6] read!
    @ ./io.jl:762 [inlined]
  [7] deserialize_hdr_raw
    @ ~/shared_app/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/messages.jl:167 [inlined]
  [8] message_handler_loop(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed ~/shared_app/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:172
  [9] process_tcp_streams(r_stream::Sockets.TCPSocket, w_stream::Sockets.TCPSocket, incoming::Bool)
    @ Distributed ~/shared_app/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:133
 [10] (::Distributed.var"#103#104"{Sockets.TCPSocket, Sockets.TCPSocket, Bool})()
    @ Distributed ./task.jl:514
      From worker 5:    Operator `forward` ran in 8.33 s

I used to have julia LTS 1.6.7 but today after this problem started annoying me I updated julia using the following commads:

using Pkg
Pkg.add("UpdateJulia")  
using UpdateJulia
update_julia()

and now julia version is 1.9.3 but the problem still appears. I can get this error at iteration 9 or 4 or probably at any other time.

I understand that the problem is unlikely related to JUDI itself but maybe you already seen that?

mloubout commented 8 months ago

It seems to happen randomly because of BLAS multithreading when there is more than one Julia worker on the same node.

mloubout commented 8 months ago

Should maybe disable it but make data time interpolation but slow sometimes without multithreading

kerim371 commented 8 months ago

Thank you! Disabling it with BLAS.set_num_threads(1) should probably work. I will try

kerim371 commented 8 months ago

I just tried to run FWI with preliminary settings in startup.jl:

@info "STARTUP SCRIPT: $(@__FILE__ )"

using LinearAlgebra
BLAS.set_num_threads(1) 

ENV["DEVITO_LANGUAGE"]="openmp"
ENV["OMP_NUM_THREADS"]=length(Sys.cpu_info())
ENV["DEVITO_LOGGING"]="INFO"

@info "Number of BLAS threads: $(BLAS.get_num_threads())"
@info "DEVITO_LANGUAGE: $(ENV["DEVITO_LANGUAGE"])"
@info "OMP_NUM_THREADS: $(ENV["OMP_NUM_THREADS"])"
@info "DEVITO_LOGGING: $(ENV["DEVITO_LOGGING"])"

and this didn't help: I got the same segfault error et 9th FWI iteration.

mloubout commented 8 months ago

This won't help JUDI set BLAS num threads in it's init so need to set it to 1 after using JUDI

https://github.com/slimgroup/JUDI.jl/blob/83752f147e175608d98b2ffe2d4778d58281c1f1/src/JUDI.jl#L203

kerim371 commented 8 months ago

This won't help JUDI set BLAS num threads in it's init so need to set it to 1 after using JUDI

https://github.com/slimgroup/JUDI.jl/blob/83752f147e175608d98b2ffe2d4778d58281c1f1/src/JUDI.jl#L203

Didnt know that! thank you!

kerim371 commented 8 months ago

@mloubout it is strange but sometimes @everywhere BLAS.set_num_threads(1) after using JUDI work and sometimes not. Now I'm trying to use only 3 of 4 computational cores on each node:

addprocs(["user@10.128.0.33",
          "user@10.128.0.23",
          "user@10.128.0.20",
          "user@10.128.0.22"], 
          env=["DEVITO_LANGUAGE"=>"openmp", "OMP_NUM_THREADS"=>"3", "DEVITO_LOGGING"=>"INFO"])

I hope this help

kerim371 commented 8 months ago

@mloubout it is strange but sometimes @everywhere BLAS.set_num_threads(1) after using JUDI work and sometimes not. Now I'm trying to use only 3 of 4 computational cores on each node:

addprocs(["user@10.128.0.33",
          "user@10.128.0.23",
          "user@10.128.0.20",
          "user@10.128.0.22"], 
          env=["DEVITO_LANGUAGE"=>"openmp", "OMP_NUM_THREADS"=>"3", "DEVITO_LOGGING"=>"INFO"])

I hope this help

Helped for now Didn't help...

kerim371 commented 8 months ago

Julia community thoughts on this (for the future references): https://github.com/JuliaLang/julia/issues/52154

mloubout commented 8 months ago

Thanks for raising it there and the update