Open sumit-walia opened 1 year ago
Hi @sumit-walia,
at the moment we do not have an option to pass a custom ordering, but it's an interesting suggestion. It's something we might consider adding at a later iteration.
In the meantime I can try to help with the problem with mash, and see if this speeds things up. Would you mind sharing the error message you get? One possibility is that the mash
command is missing from you path. A requirement for the -d mash
option to work is that you need to have the mash command installed and available in your path. Can be also installed as a conda package.
Aside from that, I never tried building graphs with many (>1k) short (<1Mbp) genomes. I'd be curious to know how pangraph behaves in this regime. Depending on what your question is, it might be important to tune the -a
,-b
, -l
parameters.
Best, Marco
Hi @mmolari,
I have installed mash and it is available in my path. -d mash
option gives the following error:
ERROR: TaskFailedException Stacktrace: [1] wait @ ./task.jl:334 [inlined] [2] fetch @ ./task.jl:349 [inlined] [3] (::PanGraph.Graphs.Shell.var"#7#12"{NTuple{3987, PanGraph.Graphs.Graph}, PanGraph.Graphs.Shell.var"#compare#10"})(path::String, io::IOStream) @ PanGraph.Graphs.Shell ~/tools/pangraph/src/cmd.jl:67 [4] mktemp(fn::PanGraph.Graphs.Shell.var"#7#12"{NTuple{3987, PanGraph.Graphs.Graph}, PanGraph.Graphs.Shell.var"#compare#10"}, parent::String) @ Base.Filesystem ./file.jl:722 [5] mktemp(fn::Function) @ Base.Filesystem ./file.jl:720 [6] mash(::PanGraph.Graphs.Graph, ::Vararg{PanGraph.Graphs.Graph}) @ PanGraph.Graphs.Shell ~/tools/pangraph/src/cmd.jl:57 [7] ordering(::Function, ::PanGraph.Graphs.Graph, ::Vararg{PanGraph.Graphs.Graph}) @ PanGraph.Graphs.Align ~/tools/pangraph/src/align.jl:313 [8] align(::PanGraph.Graphs.Graph, ::Vararg{PanGraph.Graphs.Graph}; compare::Function, energy::Function, minblock::Int64, reference::Nothing, sensitivity::String, maxiter::Int64) @ PanGraph.Graphs.Align ~/tools/pangraph/src/align.jl:634 [9] (::PanGraph.var"#1#4")(args::Vector{String}) @ PanGraph ~/tools/pangraph/src/build.jl:126 [10] run(cmd::PanGraph.Commands.Command, args::Vector{String}) @ PanGraph.Commands ~/tools/pangraph/src/args.jl:182 [11] main(args::Vector{String}) @ PanGraph ~/tools/pangraph/src/PanGraph.jl:87 [12] julia_main() @ PanGraph ~/tools/pangraph/src/PanGraph.jl:92 [13] top-level scope @ none:1 nested task error: ArgumentError: input string is empty or only contains whitespace Stacktrace: [1] tryparse_internal(#unused#::Type{Int64}, s::String, startpos::Int64, endpos::Int64, base_::Int64, raise::Bool) @ Base ./parse.jl:109 [2] parse(::Type{Int64}, s::String; base::Nothing) @ Base ./parse.jl:241 [3] parse @ ./parse.jl:241 [inlined] [4] (::PanGraph.Graphs.Shell.var"#compare#10")(path::String) @ PanGraph.Graphs.Shell ~/tools/pangraph/src/cmd.jl:43 [5] (::PanGraph.Graphs.Shell.var"#9#14"{String, PanGraph.Graphs.Shell.var"#compare#10"})() @ PanGraph.Graphs.Shell ./task.jl:423
Thanks, Sumit
Hi @sumit-walia,
from the error message it looks like the code is failing on this line. This is where the output of mash triangle
is parsed. In particular here the content of the first line of the output is captured. This should contain simply the total number of sequences, but from the error message it seems like the first line is empty.
I tried reproducing the issue on my end. I downloaded ~3000 sars-cov-2 sequences from NCBI and run pangraph with the command:
pangraph build -d mash sequences.fa
but I could not reproduce the error. Pangraph executed successfully. However when inspecting the output graph I noticed that a single genome was missing from the graph. After some debugging I found out that this was due to an input-output stream not being flushed between a write and a read operation. Now the problem should be fixed. Could you try pulling again the master branch and testing if this solves the problem in your case?
If not, here are some further checks that we could do to try to figure out where the error comes from:
mash
you have corresponds to >=2.3?mash triangle sequences.fa
on your set of sequences? In particular could you check that the first line of the output is an integer corresponding to the number of sequences in your dataset?Thanks for your feedback, it's much appreciated!
Marco
Hi, I realized that the error was caused by an outdated version of Mash that I was using. Although I can now use -d mash
, it still runs slowly.
Thanks, Sumit
I am building PanGraph for Sars-CoV2 sequences (around 50k). I observed that the ordering step is a bottleneck here. Is it possible to pass a tree as an argument and bypass the ordering step in PanGraph? Also, I have tried adding "-d mash" as an argument to check the speedup but it throws an error instead.
Any help would be appreciated.
Thanks