voutcn / megahit

Ultra-fast and memory-efficient (meta-)genome assembler
http://www.ncbi.nlm.nih.gov/pubmed/25609793
GNU General Public License v3.0
588 stars 134 forks source link

Additional assembly step for metagenome assembled genomes (MAGs) #241

Closed franciscozorrilla closed 4 years ago

franciscozorrilla commented 4 years ago

Hi,

I have noticed that many bacterial MAGs generated seem to have highly fragmented genomes (~200 contigs on average), so I am trying to implement an additional assembly step as shown here:

image

I have tested on one particular MAG I generated, but the additional assembly seems to reduce the total base pair content and increase the number of contigs!

The original MAG has 54 contigs and 2 146 230 bp, while the re-assembled MAG has 62 contigs and 1 330 953 bp.

megahit -r ERR260260_110.fa . --verbose -t 16

2019-09-21 16:10:59 - MEGAHIT v1.2.8
2019-09-21 16:10:59 - Using megahit_core with POPCNT and BMI2 support
2019-09-21 16:10:59 - Convert reads to binary library
2019-09-21 16:11:00 - INFO  sequence/io/sequence_lib.cpp  :   77 - Lib 0 (/c3se/NOBACKUP/groups/c3-c3se605-19-1/grid/test/ERR260260_110.fa): se, 54 reads, 189839 max length
2019-09-21 16:11:00 - INFO  utils/utils.h                 :  152 - Real: 0.1658 user: 0.0074    sys: 0.0147 maxrss: 5468
2019-09-21 16:11:00 - k-max reset to: 141 
2019-09-21 16:11:00 - Start assembly. Number of CPU threads 32 
2019-09-21 16:11:00 - k list: 21,29,39,59,79,99,119,141 
2019-09-21 16:11:00 - Memory used: 181106694144
2019-09-21 16:11:00 - Extract solid (k+1)-mers for k = 21 
2019-09-21 16:11:00 - Build graph for k = 21 
2019-09-21 16:11:03 - Assemble contigs from SdBG for k = 21
2019-09-21 16:11:04 - Local assembly for k = 21
2019-09-21 16:11:04 - Extract iterative edges from k = 21 to 29 
2019-09-21 16:11:05 - Build graph for k = 29 
2019-09-21 16:11:05 - Assemble contigs from SdBG for k = 29
2019-09-21 16:11:11 - Local assembly for k = 29
2019-09-21 16:11:11 - Extract iterative edges from k = 29 to 39 
2019-09-21 16:11:11 - Build graph for k = 39 
2019-09-21 16:11:12 - Assemble contigs from SdBG for k = 39
2019-09-21 16:11:16 - Local assembly for k = 39
2019-09-21 16:11:17 - Extract iterative edges from k = 39 to 59 
2019-09-21 16:11:17 - Build graph for k = 59 
2019-09-21 16:11:18 - Assemble contigs from SdBG for k = 59
2019-09-21 16:11:21 - Local assembly for k = 59
2019-09-21 16:11:22 - Extract iterative edges from k = 59 to 79 
2019-09-21 16:11:22 - Build graph for k = 79 
2019-09-21 16:11:22 - Assemble contigs from SdBG for k = 79
2019-09-21 16:11:27 - Local assembly for k = 79
2019-09-21 16:11:27 - Extract iterative edges from k = 79 to 99 
2019-09-21 16:11:27 - Build graph for k = 99 
2019-09-21 16:11:28 - Assemble contigs from SdBG for k = 99
2019-09-21 16:11:32 - Local assembly for k = 99
2019-09-21 16:11:33 - Extract iterative edges from k = 99 to 119 
2019-09-21 16:11:33 - Build graph for k = 119 
2019-09-21 16:11:33 - Assemble contigs from SdBG for k = 119
2019-09-21 16:11:37 - Local assembly for k = 119
2019-09-21 16:11:37 - Extract iterative edges from k = 119 to 141 
2019-09-21 16:11:37 - Merging to output final contigs 
2019-09-21 16:11:38 - 62 contigs, total 1330953 bp, min 449 bp, max 89977 bp, avg 21466 bp, N50 39877 bp
2019-09-21 16:11:39 - ALL DONE. Time elapsed: 40.007807 seconds 

Original MAG: ERR260260_110.fa.txt

I understand that is is not perhaps the intended use of megahit, but I figure it should still be possible to do? Do you think there is some combination of parameters that would allow me to achieve my goal?

Thanks, Francisco

voutcn commented 4 years ago

Feeding contigs to do a second round assembly is not a good idea because sequencing depth (useful for error pruning) is missing.

franciscozorrilla commented 4 years ago

I see, that would explain the output of my second round of binning. Would you instead recommend to use each bin as a reference and perform a second round of assembly using a reference-guided assembly tool?

voutcn commented 4 years ago

I have never tried such a method. It may or may not help depends on the data and the tools you use.