Bypassing ordering step

sumit-walia commented 1 year ago

I am building PanGraph for Sars-CoV2 sequences (around 50k). I observed that the ordering step is a bottleneck here. Is it possible to pass a tree as an argument and bypass the ordering step in PanGraph? Also, I have tried adding "-d mash" as an argument to check the speedup but it throws an error instead.

Any help would be appreciated.

Thanks

mmolari commented 1 year ago

Hi @sumit-walia,

at the moment we do not have an option to pass a custom ordering, but it's an interesting suggestion. It's something we might consider adding at a later iteration.

In the meantime I can try to help with the problem with mash, and see if this speeds things up. Would you mind sharing the error message you get? One possibility is that the mash command is missing from you path. A requirement for the -d mash option to work is that you need to have the mash command installed and available in your path. Can be also installed as a conda package.

Aside from that, I never tried building graphs with many (>1k) short (<1Mbp) genomes. I'd be curious to know how pangraph behaves in this regime. Depending on what your question is, it might be important to tune the -a,-b, -l parameters.

Best, Marco

sumit-walia commented 1 year ago

Hi @mmolari,

I have installed mash and it is available in my path. -d mash option gives the following error:

ERROR: TaskFailedException Stacktrace: [1] wait @ ./task.jl:334 [inlined] [2] fetch @ ./task.jl:349 [inlined] [3] (::PanGraph.Graphs.Shell.var"#7#12"{NTuple{3987, PanGraph.Graphs.Graph}, PanGraph.Graphs.Shell.var"#compare#10"})(path::String, io::IOStream) @ PanGraph.Graphs.Shell ~/tools/pangraph/src/cmd.jl:67 [4] mktemp(fn::PanGraph.Graphs.Shell.var"#7#12"{NTuple{3987, PanGraph.Graphs.Graph}, PanGraph.Graphs.Shell.var"#compare#10"}, parent::String) @ Base.Filesystem ./file.jl:722 [5] mktemp(fn::Function) @ Base.Filesystem ./file.jl:720 [6] mash(::PanGraph.Graphs.Graph, ::Vararg{PanGraph.Graphs.Graph}) @ PanGraph.Graphs.Shell ~/tools/pangraph/src/cmd.jl:57 [7] ordering(::Function, ::PanGraph.Graphs.Graph, ::Vararg{PanGraph.Graphs.Graph}) @ PanGraph.Graphs.Align ~/tools/pangraph/src/align.jl:313 [8] align(::PanGraph.Graphs.Graph, ::Vararg{PanGraph.Graphs.Graph}; compare::Function, energy::Function, minblock::Int64, reference::Nothing, sensitivity::String, maxiter::Int64) @ PanGraph.Graphs.Align ~/tools/pangraph/src/align.jl:634 [9] (::PanGraph.var"#1#4")(args::Vector{String}) @ PanGraph ~/tools/pangraph/src/build.jl:126 [10] run(cmd::PanGraph.Commands.Command, args::Vector{String}) @ PanGraph.Commands ~/tools/pangraph/src/args.jl:182 [11] main(args::Vector{String}) @ PanGraph ~/tools/pangraph/src/PanGraph.jl:87 [12] julia_main() @ PanGraph ~/tools/pangraph/src/PanGraph.jl:92 [13] top-level scope @ none:1 nested task error: ArgumentError: input string is empty or only contains whitespace Stacktrace: [1] tryparse_internal(#unused#::Type{Int64}, s::String, startpos::Int64, endpos::Int64, base_::Int64, raise::Bool) @ Base ./parse.jl:109 [2] parse(::Type{Int64}, s::String; base::Nothing) @ Base ./parse.jl:241 [3] parse @ ./parse.jl:241 [inlined] [4] (::PanGraph.Graphs.Shell.var"#compare#10")(path::String) @ PanGraph.Graphs.Shell ~/tools/pangraph/src/cmd.jl:43 [5] (::PanGraph.Graphs.Shell.var"#9#14"{String, PanGraph.Graphs.Shell.var"#compare#10"})() @ PanGraph.Graphs.Shell ./task.jl:423

Thanks, Sumit

mmolari commented 1 year ago

Hi @sumit-walia,

from the error message it looks like the code is failing on this line. This is where the output of mash triangle is parsed. In particular here the content of the first line of the output is captured. This should contain simply the total number of sequences, but from the error message it seems like the first line is empty.

I tried reproducing the issue on my end. I downloaded ~3000 sars-cov-2 sequences from NCBI and run pangraph with the command:

pangraph build -d mash sequences.fa

but I could not reproduce the error. Pangraph executed successfully. However when inspecting the output graph I noticed that a single genome was missing from the graph. After some debugging I found out that this was due to an input-output stream not being flushed between a write and a read operation. Now the problem should be fixed. Could you try pulling again the master branch and testing if this solves the problem in your case?

If not, here are some further checks that we could do to try to figure out where the error comes from:

could you check that the version of mash you have corresponds to >=2.3?
could you try running mash triangle sequences.fa on your set of sequences? In particular could you check that the first line of the output is an integer corresponding to the number of sequences in your dataset?
could you share the exact command you are running pangraph with?
could you try to run it with a subset of sequences (e.g. 1000) and see if the problem persists?

Thanks for your feedback, it's much appreciated!

Marco

sumit-walia commented 1 year ago

Hi, I realized that the error was caused by an outdated version of Mash that I was using. Although I can now use -d mash, it still runs slowly.

Thanks, Sumit

neherlab / pangraph

Bypassing ordering step #32