How to select k/w for high repetitive plant genome

baozg commented 2 years ago

Hi, @maickrau

Do you have any experiecne for select k/w for plant genome. I try to use MBG with k=1001,w=100 for 30x plant genome (30G HiFi data). It runs more than 24 hours with 64 cores 256G memory.

Here is the lastest log

try resolve k=1944, replaced 2 nodes with 6 nodes, unitigified 6 nodes to 3 nodes
try resolve k=1945, replaced 1 nodes with 3 nodes, unitigified 4 nodes to 2 nodes
try resolve k=1947, replaced 1 nodes with 2 nodes, unitigified 4 nodes to 2 nodes
try resolve k=1001, replaced 0 nodes with 0 nodes
try resolve k=1948, replaced 2 nodes with 7 nodes, unitigified 12 nodes to 4 nodes
try resolve k=1949, replaced 2 nodes with 7 nodes, unitigified 11 nodes to 4 nodes
try resolve k=1950, replaced 2 nodes with 7 nodes, unitigified 12 nodes to 4 nodes
try resolve k=1951, replaced 1 nodes with 4 nodes, unitigified 6 nodes to 3 nodes
try resolve k=1952, replaced 5 nodes with 14 nodes, unitigified 22 nodes to 11 nodes

maickrau commented 2 years ago

Haven't tried it on plants specifically but usually higher k and w lead to faster runtimes but more fragmented results. The latest thing in the log is the multiplex DBG algorithm, so you can try using k=2000,w=500 which should be much faster and hopefully not much more fragmented. If that's still too slow you can try k=2000,w=1950. Also you can try using 8 threads because more threads than that slows MBG down.

baozg commented 2 years ago

Thanks. For high-coverage HiFi data (Arabidopsis thaliana, 140x HiFi, 150Mb genome size), do I need downsample ? It also run several hours.

baozg commented 2 years ago

I try the -k 2001 -w 1950 and -k 2001 -w 500, it seems very similar based on the size and N50. How should I choose the gfa ?

# k2001 -w 500 (8threads, 11 hours)

selecting k-mers and building graph topology took 205,387 s
unitigifying took 6,406 s
filtering unitigs took 0,432 s
getting read paths took 199,877 s
resolving unitigs took 38619,672 s
building unitig sequences took 637,480 s
forcing edge consistency took 3,445 s
writing the graph and calculating stats took 15,219 s
writing sequence paths took 35,320 s
nodes: 13661
edges: 13245
assembly size 630958840 bp, N50 1022524
approximate number of k-mers ~ 603623179

# k2001 w1950 (8 threads, 40 mins)

selecting k-mers and building graph topology took 198,559 s
unitigifying took 1,766 s
filtering unitigs took 0,332 s
getting read paths took 195,5 s
resolving unitigs took 1287,47 s
building unitig sequences took 504,983 s
forcing edge consistency took 1,389 s
writing the graph and calculating stats took 13,481 s
writing sequence paths took 37,578 s
nodes: 11502
edges: 9769
assembly size 610150597 bp, N50 1086878
approximate number of k-mers ~ 587135095

maickrau commented 2 years ago

The k=2000,w=500 is probably slightly better but you can confirm that by looking at the graph with bandage https://github.com/rrwick/Bandage to see how contiguous they are and what kinds of tangles they have left.

maickrau commented 2 years ago

This seems to be resolved. If something else comes up then please open a new issue.

maickrau / MBG

How to select k/w for high repetitive plant genome #7