Closed baozg closed 2 years ago
Haven't tried it on plants specifically but usually higher k and w lead to faster runtimes but more fragmented results. The latest thing in the log is the multiplex DBG algorithm, so you can try using k=2000,w=500
which should be much faster and hopefully not much more fragmented. If that's still too slow you can try k=2000,w=1950
. Also you can try using 8 threads because more threads than that slows MBG down.
Thanks. For high-coverage HiFi data (Arabidopsis thaliana, 140x HiFi, 150Mb genome size), do I need downsample ? It also run several hours.
I try the -k 2001 -w 1950
and -k 2001 -w 500
, it seems very similar based on the size and N50. How should I choose the gfa ?
# k2001 -w 500 (8threads, 11 hours)
selecting k-mers and building graph topology took 205,387 s
unitigifying took 6,406 s
filtering unitigs took 0,432 s
getting read paths took 199,877 s
resolving unitigs took 38619,672 s
building unitig sequences took 637,480 s
forcing edge consistency took 3,445 s
writing the graph and calculating stats took 15,219 s
writing sequence paths took 35,320 s
nodes: 13661
edges: 13245
assembly size 630958840 bp, N50 1022524
approximate number of k-mers ~ 603623179
# k2001 w1950 (8 threads, 40 mins)
selecting k-mers and building graph topology took 198,559 s
unitigifying took 1,766 s
filtering unitigs took 0,332 s
getting read paths took 195,5 s
resolving unitigs took 1287,47 s
building unitig sequences took 504,983 s
forcing edge consistency took 1,389 s
writing the graph and calculating stats took 13,481 s
writing sequence paths took 37,578 s
nodes: 11502
edges: 9769
assembly size 610150597 bp, N50 1086878
approximate number of k-mers ~ 587135095
The k=2000,w=500
is probably slightly better but you can confirm that by looking at the graph with bandage https://github.com/rrwick/Bandage to see how contiguous they are and what kinds of tangles they have left.
This seems to be resolved. If something else comes up then please open a new issue.
Hi, @maickrau
Do you have any experiecne for select k/w for plant genome. I try to use
MBG
withk=1001,w=100
for 30x plant genome (30G HiFi data). It runs more than 24 hours with 64 cores 256G memory.Here is the lastest log