libprima / PRIMA.jl

a Julia interface to PRIMA, a Reference Implementation for Powell's methods with Modernization and Amelioration
MIT License
20 stars 5 forks source link

Problems with Linux (Arch) #25

Open cvalencia09 opened 4 months ago

cvalencia09 commented 4 months ago

It seems to be that the Julia package have a problem when running in a Linux machine. I have the stackoverflow error after running the newuoa algorithm, this problem doesn't occur in my Windows partition. In Windows I have the Intel compiler for Fortran, in Linux just the gcc compiler. Which libraries do you require to run the package on Linux?

Regards,

amontoison commented 4 months ago

When you use PRIMA.jl, a precompiled version of PRIMA is provided with the artifact PRIMA_jll.jl so you don't need any compiler to use this Julia interface. Can you run the Julia tests with the following commands and provide the error(s) that you encounter?

julia> ]
pkg> test PRIMA
soldasim commented 3 months ago

Hello, I have also encountered the StackOverflowError on vairous Linux devices.

I have tested the issue on multiple devices. This is what information I could gather;

Unfortunately, the error message does not provide any information (not even a stacktrace), so this may be difficult to debug. I will try to provide as much information as I can.

MWE

Consider the following MWE;

using PRIMA

function prima_serial(; tasks=1)
    obj = (x) -> abs(5. - x[1])
    start = [0.]

    results = [newuoa(obj, start)[1] for _ in 1:tasks]
end

function prima_parallel(; tasks=1)
    obj = (x) -> abs(5. - x[1])
    start = [0.]

    tasks = [Threads.@spawn newuoa(obj, start)[1] for _ in 1:tasks]
    results = fetch.(tasks)
end

Note that I am running only a single task when parallelizing. So there actually are not multiple PRIMA instances running in parallel. But somehow it causes errors on some devices anyway.

Test Results

The following table summarizes the results of running the two functions prima_serial and prima_parallel from the MWE on various devices that I have access to:

Device prima_serial prima_parallel
PC-1 (Win) :white_check_mark: :white_check_mark:
PC-2 (Win) :white_check_mark: :white_check_mark:
PC-2 (Linux) :white_check_mark: StackOverflowError
PC-2 (Linux) [VSCode REPL] StackOverflowError StackOverflowError
Cluster-1 (Linux) :white_check_mark: StackOverflowError
Cluster-2 (Linux) :white_check_mark: StackOverflowError

(Note that PC-2 (Win) and PC-2 (Linux) is the same exact computer with dualboot.)

Correct output:

julia> prima_serial()
1-element Vector{Vector{Float64}}:
 [5.0000000000000115]

julia> 

Error message for prima_serial:

julia> prima_serial()
ERROR: StackOverflowError:

julia> 

Error message for prima_parallel:

julia> prima_parallel()
ERROR: TaskFailedException
Stacktrace:
  [1] wait
    @ ./task.jl:352 [inlined]
  [2] fetch
    @ ./task.jl:372 [inlined]
  [3] _broadcast_getindex_evalf
    @ ./broadcast.jl:709 [inlined]
  [4] _broadcast_getindex
    @ ./broadcast.jl:682 [inlined]
  [5] getindex
    @ ./broadcast.jl:636 [inlined]
  [6] copy
    @ ./broadcast.jl:942 [inlined]
  [7] materialize
    @ ./broadcast.jl:903 [inlined]
  [8] prima_parallel(; tasks::Int64)
    @ Main ~/julia-sandbox/prima_parallel/test.jl:15
  [9] prima_parallel()
    @ Main ~/julia-sandbox/prima_parallel/test.jl:10
 [10] top-level scope
    @ REPL[3]:1

    nested task error: StackOverflowError:

julia> 

Device Specifications

PC-1 and PC-2 are my personal computers. PC-2 has both Windows and Linux on dualboot. Cluster-1 and Cluster-2 are academic clusters that I have access to. The information below contains specs of both the "login" and "work" nodes from the clusters. I've tested the MWE on both the login and work nodes and the behavior does not differ between them.

PC-1

OS: Microsoft Windows 10 Pro Version: 10.0.19045 Build 19045 System Type: x64-based PC Processor: Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz, 3401 Mhz, 4 Core(s), 4 Logical Processor(s)

Julia version: 1.10.2

PC-2 (Windows)

OS: Microsoft Windows 10 Home Version: 10.0.19045 Build 19045 System Type: x64-based PC Processor: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, 2001 Mhz, 4 Core(s), 8 Logical Processor(s)

Julia version: 1.10.2

PC-2 (Linux)

OS: Ubuntu 20.04.6 LTS Kernel: Linux 5.4.0-171-generic Architecture: x86-64 Processor: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz

Julia version: 1.10.2

Cluster-1

Login Node: OS: CentOS Linux 7 (Core) Kernel: Linux 3.10.0-1127.13.1.el7.x86_64 Architecture: x86-64 Processor: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz

Work Node: Kernel: Linux 4.18.0-425.13.1.el8_7.x86_64 Architecture: x86-64 Processor: Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz

Julia version: 1.10.0

Cluster-2

Login Node: OS: Ubuntu 20.04.6 LTS Kernel: Linux 5.15.0-94-generic Architecture: x86-64 Processor: Common KVM processor

Work Node: OS: Ubuntu 22.04.3 LTS Kernel: Linux 5.15.0-91-generic Architecture: x86-64 Processor: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz

Julia version: 1.10.2

Let me know if I can help with any additional information or testing. :)

EDIT: Added PC-2 (Linux) VSCode and "non-VSCode" versions to the test result table.

soldasim commented 3 months ago

I have run ] test PRIMA on all of the devices mentioned above and all tests succeed on all devices.

Test Summary: | Pass  Total   Time
PRIMA.jl      |   81     81  12.9s
     Testing PRIMA tests passed 
emmt commented 3 months ago

Thank you for all these details. I have tested your examples on my Linux laptop (Ubuntu 23.10 with 6.0.0 kernel) with the following results:

julia> prima_serial()
1-element Vector{Vector{Float64}}:
 [5.0000000000000115]

julia> prima_parallel()
ERROR: TaskFailedException
Stacktrace:
  [1] wait
    @ ./task.jl:352 [inlined]
  [2] fetch
    @ ./task.jl:372 [inlined]
  [3] _broadcast_getindex_evalf
    @ ./broadcast.jl:709 [inlined]
  [4] _broadcast_getindex
    @ ./broadcast.jl:682 [inlined]
  [5] getindex
    @ ./broadcast.jl:636 [inlined]
  [6] copy
    @ ./broadcast.jl:942 [inlined]
  [7] materialize
    @ ./broadcast.jl:903 [inlined]
  [8] prima_parallel(; tasks::Int64)
    @ Main ./REPL[3]:6
  [9] prima_parallel()
    @ Main ./REPL[3]:1
 [10] top-level scope
    @ REPL[7]:1

    nested task error: StackOverflowError:

So the serial version worked, not the parallel one. Note that the serial version also worked for tasks=2 or more, not the parallel one (with always the same stack overflow error).

Are you sure that the serial version failed on your PC-2 (Linux)?

For the parallel version, I can see some questions that need to be answered:

  1. Are the functions in libprima (the Fortran90 and the C versions) thread safe or not?
  2. The StackOverflowError seems to indicate a problem on the side of the Julia interface. So the same question arises for the Julia version. In principle, this interface allows for having an objective function that itself calls one of the PRIMA optimizers (hierarchical optimization). But it may not have been fully tested.
amontoison commented 3 months ago

In Julia, the use of @ccall is thread-safe if it can help to isolate the issue.

emmt commented 3 months ago

Yes @ccall is used but the problem I can see, is that the Julia interface uses a per-thread stack of contexts (stored in the global variable _objfun_stack) to be at the same time thread-safe and to allow for hierarchical optimization and this has not been thoroughly tested. With the new C API of libprima (see #28) this management would no longer be necessary and the problem may be solved (provided the C and Fortran code are thread-safe).

amontoison commented 3 months ago

Ok, I see 👍 It increases the priority to do an unoffical build of PRIMA_jll.jl v0.8.0 asap.

soldasim commented 3 months ago

Are you sure that the serial version failed on your PC-2 (Linux)?

I have tested it again to be sure. The serial version really fails on my Linux PC but only in Julia started by VSCode.

When I run the two functions from Julia REPL started by the VSCode's Julia extension (Ctrl+Shift+P -> Julia: Start REPL), both the serial and the parallel version throw StackOverflowError.

When I start Julia REPL from bash myself, only the parallel version fails and the serial works fine as on other linux devices.

I don't know what to make of this, but at least it is consistent when tried multiple times.

Version info

The only difference in versioninfo() is that the REPL started by VSCode has an additional line JULIA_EDITOR = code in the Environment.

julia> versioninfo()
Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
Threads: 8 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
  LD_LIBRARY_PATH = :/opt/gurobi10.0.0_linux64/gurobi1000/linux64/lib
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8

julia> 
emmt commented 3 months ago

Ok that's really puzzling...

We have started to figure out a way to deal with thread-safety (and hierarchical optimization) differntly than currently done in PRIMA.jl. I hope this will solve the issue. Your MWE should definitely be part of the test suite of PRIMA.jl.