new bug introduced with fix for #21

jonasjonker commented 3 years ago

Hey Janko,

I'm afraid your fix for #21 introduced a bug.

before

(before) pkg> status
Project before v0.1.0
Status `~/before/Project.toml`
  [2be3f83a] FlashWeave v0.18.0

julia> using Distributed

julia> addprocs(1)
1-element Array{Int64,1}:
 2

julia> @show Distributed.procs()
Distributed.procs() = [1, 2]
2-element Array{Int64,1}:
 1
 2

julia> @everywhere using FlashWeave
[ Info: Precompiling FlashWeave [2be3f83a-7913-5748-9f20-7d448995b934]

julia> ID          = 1001
1001

julia> ROOT        = "/home/jonas/Repos/Thesis/data/"
"/home/jonas/Repos/Thesis/data/"

julia> data_path   = "$(ROOT)$(ID)/processed_data/1_otu_table.biom"
"/home/jonas/Repos/Thesis/data/1001/processed_data/1_otu_table.biom"

julia> netw_results  = FlashWeave.learn_network(data_path,
                                               sensitive     = true,
                                               heterogeneous = false)

### Loading data ###

Inferring network with FlashWeave - sensitive (conditional)

    Run information:
    sensitive - true
    heterogeneous - false
    max_k - 3
    alpha - 0.01
    sparse - true
    workers - 1
    OTUs - 6558
    MVs - 0

### Normalizing ###

Removing variables with 0 variance (or equivalently 1 level) and samples with 0 reads
Discarded 0 samples and 0 variables.

Normalization
┌ Warning: adaptive pseudo-counts for 3 samples were lower than machine precision due to insufficient counts, removing them
└ @ FlashWeave ~/.julia/packages/FlashWeave/464SQ/src/preprocessing.jl:125

### Learning interactions ###

Setting 'time_limit' to 13.0 s.
Automatically setting 'n_obs_min' to 20 for enhanced reliability.
Computing univariate associations..

Univariate degree stats:
Summary Stats:
Length:         6558
Missing Count:  0
Mean:           114.928637
Minimum:        0.000000
1st Quartile:   37.000000
Median:         124.000000
3rd Quartile:   179.000000
Maximum:        356.000000

Starting conditioning search..

Preparing workers..

Done. Starting inference..
Starting convergence checks at 7733 edges.
Latest convergence step change: 0.83625

Postprocessing..
Complete.

Finished inference. Total time taken: 19.211s

Mode:
FlashWeave - sensitive (conditional)

Network:
8491 interactions between 6558 variables (6558 OTUs and 0 MVs)

Unfinished variables:
none

Rejections:
not tracked

julia>

after

(after) pkg> status
Project after v0.1.0
Status `~/after/Project.toml`
  [2be3f83a] FlashWeave v0.18.0 `https://github.com/meringlab/FlashWeave.jl.git#master`

julia> using Distributed

julia> addprocs(1)
1-element Array{Int64,1}:
 2

julia> @show Distributed.procs()
Distributed.procs() = [1, 2]
2-element Array{Int64,1}:
 1
 2

julia> @everywhere using FlashWeave

julia> ID          = 1001
1001

julia> ROOT        = "/home/jonas/Repos/Thesis/data/"
"/home/jonas/Repos/Thesis/data/"

julia> data_path   = "$(ROOT)$(ID)/processed_data/1_otu_table.biom"
"/home/jonas/Repos/Thesis/data/1001/processed_data/1_otu_table.biom"

julia> netw_results  = FlashWeave.learn_network(data_path,
                                               sensitive     = true,
                                               heterogeneous = false)

### Loading data ###

### Normalizing ###

Removing variables with 0 variance (or equivalently 1 level) and samples with 0 reads
    -> no samples or variables discarded

Normalization
┌ Warning: adaptive pseudo-counts for 3 samples were lower than machine precision due to insufficient counts, removing them
└ @ FlashWeave ~/.julia/packages/FlashWeave/9pt8o/src/preprocessing.jl:125

### Learning interactions ###

Inferring network with FlashWeave - sensitive (conditional)

    Run information:
    sensitive - true
    heterogeneous - false
    max_k - 3
    alpha - 0.01
    sparse - false
    workers - 1
    OTUs - 6558
    MVs - 0

Setting 'time_limit' to 13.0 s.
Automatically setting 'n_obs_min' to 20 for enhanced reliability
Computing univariate associations
ERROR: On worker 2:
UndefVarError: #55#56 not defined
deserialize_datatype at /opt/julia/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:1252
handle_deserialize at /opt/julia/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:826
deserialize at /opt/julia/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:773
handle_deserialize at /opt/julia/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:833
deserialize at /opt/julia/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:773 [inlined]
deserialize_msg at /opt/julia/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:99
#invokelatest#1 at ./essentials.jl:710 [inlined]
invokelatest at ./essentials.jl:709 [inlined]
message_handler_loop at /opt/julia/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:185
process_tcp_streams at /opt/julia/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:142
#99 at ./task.jl:356
Stacktrace:
 [1] #remotecall_fetch#143 at /opt/julia/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:394 [inlined]
 [2] remotecall_fetch(::Function, ::Distributed.Worker) at /opt/julia/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:386
 [3] remotecall_fetch(::Function, ::Int64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /opt/julia/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421
 [4] remotecall_fetch at /opt/julia/usr/share/julia/stdlib/v1.5/Distributed/src/remotecall.jl:421 [inlined]
 [5] workers_all_local() at /home/jonas/.julia/packages/FlashWeave/9pt8o/src/misc.jl:96
 [6] prepare_univar_results(::Array{Float32,2}, ::String, ::Float64, ::Int64, ::Int64, ::Bool, ::Array{Int32,1}, ::String, ::Array{Float32,2}, ::Bool, ::Bool, ::String) at /home/jonas/.julia/packages/FlashWeave/9pt8o/src/learning.jl:86
 [7] LGL(::Array{Float32,2}; test_name::String, max_k::Int64, alpha::Float64, hps::Int64, n_obs_min::Int64, max_tests::Int64, convergence_threshold::Float64, FDR::Bool, parallel::String, fast_elim::Bool, no_red_tests::Bool, weight_type::String, edge_rule::String, nonsparse_cond::Bool, verbose::Bool, update_interval::Float64, edge_merge_fun::typeof(FlashWeave.maxweight), tmp_folder::String, debug::Int64, time_limit::Float64, header::Array{String,1}, meta_variable_mask::Nothing, dense_cor::Bool, recursive_pcor::Bool, cache_pcor::Bool, correct_reliable_only::Bool, feed_forward::Bool, track_rejections::Bool, all_univar_nbrs::Nothing, kill_remote_workers::Bool, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/jonas/.julia/packages/FlashWeave/9pt8o/src/learning.jl:229
 [8] macro expansion at ./timing.jl:310 [inlined]
 [9] learn_network(::SparseArrays.SparseMatrixCSC{Int64,Int64}; sensitive::Bool, heterogeneous::Bool, max_k::Int64, alpha::Float64, conv::Float64, header::Array{String,1}, meta_mask::BitArray{1}, feed_forward::Bool, fast_elim::Bool, normalize::Bool, track_rejections::Bool, verbose::Bool, transposed::Bool, prec::Int64, make_sparse::Bool, make_onehot::Bool, max_tests::Int64, hps::Int64, FDR::Bool, n_obs_min::Int64, cache_pcor::Bool, time_limit::Float64, update_interval::Float64, parallel_mode::String, extra_data::Nothing) at /home/jonas/.julia/packages/FlashWeave/9pt8o/src/learning.jl:526
 [10] learn_network(::String, ::Nothing; otu_data_key::String, otu_header_key::String, meta_data_key::String, meta_header_key::String, verbose::Bool, transposed::Bool, kwargs::Base.Iterators.Pairs{Symbol,Bool,Tuple{Symbol,Symbol},NamedTuple{(:sensitive, :heterogeneous),Tuple{Bool,Bool}}}) at /home/jonas/.julia/packages/FlashWeave/9pt8o/src/learning.jl:335
 [11] top-level scope at REPL[10]:1

julia>

jonasjonker commented 3 years ago

Oh, in case you need the data. I used a publicly available data set from Qitta

jtackm commented 3 years ago

Sorry to hear this. Strangely, this all works without issues on my machine and OS (also, the fix passed all tests without error). Could you show me the output of versioninfo()? Also, it would be helpful if you could re-run the example without parallel workers so we can get a more interpretable error message.

jtackm commented 3 years ago

One more thing: after adding workers and doing @everywhere using FlashWeave, could you run FlashWeave.workers_all_local() directly to see if this results in the same crash?

jonasjonker commented 3 years ago

I did. The same error I found earlier occurred after running FlashWeave.workers_all_local()

versioninfo()
Commit 788b2c77c1* (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 3600 6-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, znver2)
Environment:
  JULIA_EDITOR = atom  -a
  JULIA_NUM_THREADS = 6

FlashWeave.workers_all_local()
On worker 2:
UndefVarError: #55#56 not defined
deserialize_datatype at /opt/julia/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:1252
handle_deserialize at /opt/julia/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:826
deserialize at /opt/julia/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:773
handle_deserialize at /opt/julia/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:833
deserialize at /opt/julia/usr/share/julia/stdlib/v1.5/Serialization/src/Serialization.jl:773 [inlined]
deserialize_msg at /opt/julia/usr/share/julia/stdlib/v1.5/Distributed/src/messages.jl:99
#invokelatest#1 at ./essentials.jl:710 [inlined]
invokelatest at ./essentials.jl:709 [inlined]
message_handler_loop at /opt/julia/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:185
process_tcp_streams at /opt/julia/usr/share/julia/stdlib/v1.5/Distributed/src/process_messages.jl:142
#99 at ./task.jl:356
in top-level scope at Repos/Thesis/src/scripts/minimal_reproducible_example.jl:7
in workers_all_local at FlashWeave/9pt8o/src/misc.jl:96
in remotecall_fetch at stdlib/v1.5/Distributed/src/remotecall.jl:421 
in #remotecall_fetch#146 at stdlib/v1.5/Distributed/src/remotecall.jl:421
in remotecall_fetch at stdlib/v1.5/Distributed/src/remotecall.jl:386
in #remotecall_fetch#143 at stdlib/v1.5/Distributed/src/remotecall.jl:394

How do I re-run the code without without parallel workers? The bug only occurs when I run this:

using Distributed
addprocs(1)
@everywhere using FlashWeave

If i just do using FlashWeave there is no problem.

jtackm commented 3 years ago

You may have forgotten to copy the first line of versioninfo(), what julia version are you on? Anyways, I can't replicate this on either 1.3, 1.5 or 1.6beta. It's also weird this only occurs on master since the fix (or any other recent commit) hasn't touched workers_all_local() at all. Could you perhaps do

julia> using Distributed
julia> addprocs(1)
julia> @everywhere println(gethostname())

and then

remotecall_fetch(()->gethostname(), 2)

just to narrow the options down.

jonasjonker commented 3 years ago

After restarting my ide (atom) I can't reproduce my error anymore.

VERSION  # v"1.5.3"

addprocs(1)

@everywhere println(gethostname()) 
remotecall_fetch(()->gethostname(), 2) # "LDTett"

However, now I find that I cannot reliably reproduce the error. I made separate project.toml files to make a minimal reproducible example. But now I find that the error occurred only after loading an env:

Pkg> activate env

However, then I added FlashWeave#master to my default environment and now I cannot reproduce the error within the environment anymore...

Maybe this is not a FlashWeave problem.. I'm sorry.

I'll let you know what the problem was/is when/if I found the cause.

jtackm commented 3 years ago

No problem, but good to know that this seems to be a more obscure (and perhaps rare) bug. In any case, if the line remotecall_fetch(()->gethostname(), 2) doesn't work, this really sounds like a bug in Distributed or Julia itself and would be worth reporting in their repositories.

jonasjonker commented 3 years ago

Yes I will report it once I figure out how to reliably reproduce it.

Thank you for your time

jtackm commented 3 years ago

Thanks for your effort, feedback like this is very valuable!

meringlab / FlashWeave.jl

new bug introduced with fix for #21 #22

before

after