voutcn / megahit

Ultra-fast and memory-efficient (meta-)genome assembler
http://www.ncbi.nlm.nih.gov/pubmed/25609793
GNU General Public License v3.0
585 stars 134 forks source link

[All datasets] Suspicious report of identical # unitigs disconnected in graph pruning #333

Open GabeAl opened 2 years ago

GabeAl commented 2 years ago

Hello, in experimenting with graph pruning options, I noticed after no more tips are removed, each subsequent cleaning round reports removing the identical number of unitigs. This is suspicious as it seems statistically unlikely that an arbitrarily large number of rounds in a row would decide to remove exactly the same number of unitigs, usually a number slighly less than the last round where >= 1 tip was pruned.

One example for k21 is shown below (although this behavior occurs at all k until much higher k > ~99 in most cases, after which it will early terminate before doing all rounds, as expected). Is this an error in reporting or is the program really disconnecting the exact same number of unitigs infinitely? Could it be trying to re-remove already-removed unitigs? Is there a risk of corrupting the DBG?

This happens in all metagenomes tested (25 of them), especially at lower k-mer sizes (< 99). In the log below, note where it says "221 - Number unitigs disconnected: 397" 15 times in a row (or even more if a higher number of cleaning rounds is specified). Sometimes this repetition occurs as early as round 3 or 4, so even with the default number of cleaning rounds (5), this suspicious behavior occurs, but is less easy to notice.

megahit_core assemble -s /tmp/megahit_tmp_xbrq8ica/k21/21 -o /mnt/d/work/metagenomes/hazda_infant/ERR7738660/intermediate_contigs/k21 -t 12 --min_standalone 300 --prune_level 2 --merge_len 20 --merge_similar 0.95 --cleaning_rounds 20 --disconnect_ratio 0.1 --low_local_ratio 0.2 --cleaning_rounds 20 --min_depth 2 --bubble_level 2 --max_tip_len -1 --careful_bubble
2022-06-30 02:49:37 - b'INFO  main_assemble.cpp             :  129 - Loading succinct de Bruijn graph: /tmp/megahit_tmp_xbrq8ica/k21/21Done. Time elapsed: 36.726976'
2022-06-30 02:49:37 - b'INFO  main_assemble.cpp             :  133 - Number of Edges: 3426239645; K value: 21'
2022-06-30 02:49:37 - b'INFO  main_assemble.cpp             :  140 - Number of CPU threads: 12'
2022-06-30 02:50:21 - b'INFO  assembly/sdbg_pruning.cpp     :  160 - Removing tips with length less than 2; Accumulated tips removed: 3616065; time elapsed: 3.3208'
2022-06-30 02:50:26 - b'INFO  assembly/sdbg_pruning.cpp     :  160 - Removing tips with length less than 4; Accumulated tips removed: 7809445; time elapsed: 5.7997'
2022-06-30 02:50:40 - b'INFO  assembly/sdbg_pruning.cpp     :  160 - Removing tips with length less than 8; Accumulated tips removed: 14000139; time elapsed: 13.2320'
2022-06-30 02:51:03 - b'INFO  assembly/sdbg_pruning.cpp     :  160 - Removing tips with length less than 16; Accumulated tips removed: 22451461; time elapsed: 23.7826'
2022-06-30 02:51:32 - b'INFO  assembly/sdbg_pruning.cpp     :  160 - Removing tips with length less than 32; Accumulated tips removed: 28140288; time elapsed: 28.4302'
2022-06-30 02:51:56 - b'INFO  assembly/sdbg_pruning.cpp     :  169 - Removing tips with length less than 42; Accumulated tips removed: 28904439; time elapsed: 24.0890'
2022-06-30 02:51:56 - b'INFO  main_assemble.cpp             :  158 - Tips removal done! Time elapsed(sec): 138.728'
2022-06-30 03:05:02 - b'INFO  assembly/unitig_graph.cpp     :   83 - Graph size without loops: 80760811, palindrome: 6296'
2022-06-30 03:05:30 - b'INFO  main_assemble.cpp             :  166 - unitig graph size: 80763544, time for building: 813.879'
2022-06-30 03:06:06 - b'INFO  assembly/contig_stat.h        :   40 - Max: 4483, Min: 22, N50: 43, number contigs: 80763544, number isolated: 518792, number looped: 2733, total size: 3243814033,'
2022-06-30 03:06:06 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 1'
2022-06-30 03:06:52 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 1746811, Time elapsed(sec): 46.318'
2022-06-30 03:07:30 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 306965, Time elapsed(sec): 37.585308'
2022-06-30 03:08:06 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 28513317, time: 36.000'
2022-06-30 03:08:16 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 6756, time: 10.917'
2022-06-30 03:08:17 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 2'
2022-06-30 03:09:24 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 10381398, time: 67.238'
2022-06-30 03:09:40 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 204720, Time elapsed(sec): 16.711'
2022-06-30 03:10:00 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 52328, Time elapsed(sec): 19.295662'
2022-06-30 03:10:12 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 59571, time: 12.034'
2022-06-30 03:10:20 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 97, time: 8.702'
2022-06-30 03:10:20 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 3'
2022-06-30 03:11:03 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 9423, time: 42.508'
2022-06-30 03:11:17 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 740, Time elapsed(sec): 14.244'
2022-06-30 03:11:35 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 344, Time elapsed(sec): 18.167730'
2022-06-30 03:11:47 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 1246, time: 11.648'
2022-06-30 03:11:56 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 5, time: 8.569'
2022-06-30 03:11:56 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 4'
2022-06-30 03:12:38 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 142, time: 42.249'
2022-06-30 03:12:52 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 5, Time elapsed(sec): 14.187'
2022-06-30 03:13:10 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 1, Time elapsed(sec): 17.982608'
2022-06-30 03:13:22 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 421, time: 11.667'
2022-06-30 03:13:24 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.788'
2022-06-30 03:13:24 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 5'
2022-06-30 03:14:06 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 8, time: 41.767'
2022-06-30 03:14:20 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.943'
2022-06-30 03:14:38 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.863159'
2022-06-30 03:14:50 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 398, time: 11.667'
2022-06-30 03:14:52 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.758'
2022-06-30 03:14:52 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 6'
2022-06-30 03:15:34 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 41.208'
2022-06-30 03:15:48 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.952'
2022-06-30 03:16:06 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.856943'
2022-06-30 03:16:17 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.418'
2022-06-30 03:16:20 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.785'
2022-06-30 03:16:20 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 7'
2022-06-30 03:17:01 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 41.071'
2022-06-30 03:17:15 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.964'
2022-06-30 03:17:33 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.805315'
2022-06-30 03:17:44 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.458'
2022-06-30 03:17:47 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.791'
2022-06-30 03:17:47 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 8'
2022-06-30 03:18:28 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 40.944'
2022-06-30 03:18:42 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.922'
2022-06-30 03:19:00 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.873091'
2022-06-30 03:19:11 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.499'
2022-06-30 03:19:14 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.772'
2022-06-30 03:19:14 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 9'
2022-06-30 03:19:55 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 41.110'
2022-06-30 03:20:09 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.941'
2022-06-30 03:20:27 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.789024'
2022-06-30 03:20:38 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.431'
2022-06-30 03:20:41 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.761'
2022-06-30 03:20:41 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 10'
2022-06-30 03:21:22 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 41.126'
2022-06-30 03:21:36 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.883'
2022-06-30 03:21:54 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.871787'
2022-06-30 03:22:05 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.486'
2022-06-30 03:22:08 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.771'
2022-06-30 03:22:08 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 11'
2022-06-30 03:22:49 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 41.025'
2022-06-30 03:23:03 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.962'
2022-06-30 03:23:21 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.795757'
2022-06-30 03:23:32 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.439'
2022-06-30 03:23:35 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.778'
2022-06-30 03:23:35 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 12'
2022-06-30 03:24:16 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 41.141'
2022-06-30 03:24:30 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.944'
2022-06-30 03:24:48 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.847338'
2022-06-30 03:24:59 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.497'
2022-06-30 03:25:02 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.766'
2022-06-30 03:25:02 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 13'
2022-06-30 03:25:43 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 41.113'
2022-06-30 03:25:57 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.948'
2022-06-30 03:26:15 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.838713'
2022-06-30 03:26:27 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.459'
2022-06-30 03:26:29 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.780'
2022-06-30 03:26:29 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 14'
2022-06-30 03:27:10 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 41.058'
2022-06-30 03:27:24 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.921'
2022-06-30 03:27:42 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.745705'
2022-06-30 03:27:53 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.458'
2022-06-30 03:27:56 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.762'
2022-06-30 03:27:56 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 15'
2022-06-30 03:28:37 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 41.033'
2022-06-30 03:28:51 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.968'
2022-06-30 03:29:09 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.867751'
2022-06-30 03:29:20 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.336'
2022-06-30 03:29:23 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.783'
2022-06-30 03:29:23 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 16'
2022-06-30 03:30:04 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 41.030'
2022-06-30 03:30:18 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 14.048'
2022-06-30 03:30:36 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.704088'
2022-06-30 03:30:47 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.443'
2022-06-30 03:30:50 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.775'
2022-06-30 03:30:50 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 17'
2022-06-30 03:31:31 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 41.054'
2022-06-30 03:31:45 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 14.050'
2022-06-30 03:32:03 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.867730'
2022-06-30 03:32:15 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.337'
2022-06-30 03:32:17 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.774'
2022-06-30 03:32:17 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 18'
2022-06-30 03:32:58 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 40.963'
2022-06-30 03:33:12 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.894'
2022-06-30 03:33:30 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.836378'
2022-06-30 03:33:41 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.335'
2022-06-30 03:33:44 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.785'
2022-06-30 03:33:44 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 19'
2022-06-30 03:34:25 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 41.007'
2022-06-30 03:34:39 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.892'
2022-06-30 03:34:57 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.798166'
2022-06-30 03:35:08 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.483'
2022-06-30 03:35:11 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.779'
2022-06-30 03:35:11 - b'INFO  main_assemble.cpp             :  184 - Graph cleaning round 20'
2022-06-30 03:35:52 - b'INFO  main_assemble.cpp             :  192 - Tips removed: 0, time: 40.954'
2022-06-30 03:36:06 - b'INFO  main_assemble.cpp             :  200 - Number of bubbles removed: 0, Time elapsed(sec): 13.951'
2022-06-30 03:36:24 - b'INFO  main_assemble.cpp             :  210 - Number of complex bubbles removed: 0, Time elapsed(sec): 17.844196'
2022-06-30 03:36:35 - b'INFO  main_assemble.cpp             :  221 - Number unitigs disconnected: 397, time: 11.454'
2022-06-30 03:36:38 - b'INFO  main_assemble.cpp             :  245 - Unitigs removed in excessive pruning: 0, time: 2.766'

This occurs with multiple datasets and at multiple k-mer lengths.