Metagenome assembly: counter-intuitive results assembling unmerged vs. merged reads

rcmueller commented 2 years ago

Hi All,

For a large soil metagenome project, we trial-assemble subsets to find the best strategy and settings for megahit.

We assembled 93 Mio PE reads derived from one of the soil samples (Illumina NovaSeq, 2 x 150bp, fragment size < 300bp) after QC, dereplication, etc.

We also merged reads (using vsearch) with sufficient overlap since the fragment size is small enough for some reads 1 and 2 to overlap. A comparison of the assembly between unmerged and merged reads datasets differ significantly and are (in our opinion) counter-intuitive:

assembly of unmerged reads dataset input: PE, 2 x 93 Mio (options -1 and -2) output: 5.8 Mio contigs, N50 = 484 bp; max length = 96 Mb
assembly of merged reads dataset input: 60 Mio merged reads plus 2 x 33 Mio unmerged (the remainder that could not be merged) (options -r, -1 and -2) output: 7.6 Mio contigs, N50 = 470 bp; max length = 68 Mb

Assembler settings on both runs: preset = meta-sensitive, 40 cores, 450 GB memory

We would have expected that the assembly of the merged reads dataset produced fewer and longer contigs since around 65% of the reads were already merged, but the opposite was true. Also, the walltime was longer in the merged reads dataset (72h vs. 60h for unmerged).

We are now wondering if (i) megahit tries and merge reads from a PE library if they overlap at all and (ii) the assembly of the unmerged reads dataset possibly produced chimeric sequences and thus artificially inflated the results (compared to the assembly of the merged reads dataset).

Is there anything obvious we are missing? Any help/insight appreciated.

Cheers, Ralf

jvollme commented 2 years ago

Hi ralf, This is probably too late and I am not involved in the development of megahit in any way, but i think i may have an idea what could have happened here.

Did you set the -min-contig-len option, or leave it at the default value? Because in the latter case, only contigs above 200 bp are returned.

Especially during the assembly of a complex dataset, such as soil, it is likely that many fragments cannot be reconstructed/assembled beyond the original read length, because the coverage is too low to find full overlapping k-mers between different reads. In these cases the assembled contig basically represents the original read. In the case of your reads (single read length 150bp, fragment length <300bp) these would be below 200bp and therefore not be returned when using the default settings for -min-contig-len. However, unless you massively over-sheared your sample, your merged reads will likely be in the range of 200-300bp, right? So these WOULD be returned under default settings, artificially increasing your contig-count and lowering your N50 (because the number of unassembled reads is usually frustratingly massive for complex samples such as soil).

So in order to get actually comparable numbers, you would first have to filter both final assemblies to the same minimum-contig-size (remove all contigs below a certain conig-length). I would recommend at least 500bp, but in case you are planning to do binning even 1kb may be more appropriate.

Also I would recommend always using a -min-contig-len setting that is above your longest merged/unassembled read. I usually use 500bp.

Cheers, John

rcmueller commented 2 years ago

Thanks, @jvollme for your insights! And yes, most merged reads are around 260 bp.

Indeed, we ended up using megahit with default settings (except for meta-sensitive) on the unmerged data. However, I am still wondering whether megahit looks for overlapping fragments first in a PE library, or if it bluntly works its k-mer based algorithm. Looking at the results and reading your idea on what was going on, I assume the latter. For the current project, we are interested in functional genes (gene prediction -> annotation -> comparison between treatments), but in a potential follow-up study, we might bin MAGs and therefore filter for larger contigs only (or redo the assembly with -min-contig-len as you suggested).

For the sake of completeness: I am also currently looking into Plass as an alternative for megahit/prodigal, which at first glance is looking promising (both, processing time and results).

Cheers, Ralf

jvollme commented 2 years ago

Hmm, from what i can gather, plass should be suitable for functional gene analysis. But for your potential follow-up (reconstructing MAGs) you will probably have to fall back on megahit after all... Haven't tracked the most recent assemblers available for some time now, but as far as I understand metaspades tracks more info of the paired reads (using k-mer pairs from both reads and tracking the exact distance between them), and also often gives superiour assemblies, but it is tricky to use for coassembly of multiple datasets and also it as extremely higher memory consumption and run times compared to megahit. For very large and complex datasets that I want to bin, I personally still stick to megahit.

rcmueller commented 2 years ago

Yeah, unfortunately, metaspades could not handle our large datasets. By looking at the current literature (complex metagenomes or assembler comparisons), megahit still seems to be the assembler of choice.

jvollme commented 2 years ago

in that case i would still recommend merging the reads first (it should significantly enhance the likelyhood to assembly very low abundant genome fragments). But of course this would only affect the low abundant genomes. For high abundant organisms with already enough read-coverage to find full overlapping kmers between overlapping reads it should basically have no effect. So the question on whether to merge or not is basically: how sensitive do you want to be for also low abundant species?

rcmueller commented 2 years ago

Good point! This (and the numbers from the trial above) almost suggest, however, that megahit does not even consider the fraction of the merged reads (-r) for further extension.

jvollme commented 2 years ago

Well, the numbers from the trial above are skewed and not really comparable. for the run withut read merging, you only output contigs that were assembled beyond the single read level (remember: debruin-graph assemblers don't actually align/assemble the reads themselves. They extract the overlapping k-mers from the reads and assemble them. For reads that do not overlap at least k bases, that means only the original reads can be (re-)assembled, individually). For the run with merged reads you output also the unassembled (although merged) reads, giving you an artificially higher number of "contigs" and artificially lower N50.

You can only compare these numbers (contig-count & N50) if you exclude the unassembled reads in both cases, meaning you remove all contigs below 500 bp (or at the very least 300bp)...

I am pretty sure you will find better assembly metrics for the merged-read-assembly if you consider only contigs above 500bp for both assemblies (at least for N50. Obviously the maximum contig length will remain the same)

EDIT: Oh, and of course the merging can only actually help in the assembly, if it combines reads overlapping by less than k. So in case you start with minimum k=31, then only reads merged with an overlap of less than 31 will contribute novel overlapping k-mers between the paired reacds that would otherwise be missed. If your minimum overlap during the read-merging step is higher than 31, then it contributes nothing new to the assembly process (as all overlaps of more than 31bp are already represented by the extracted kmers in the deBruijn-graph). [I basically see it as a kind of pre-assembly of only the most trusted overlaps: those between read pairs of the same fragment, that therefore can go below the otherwise minimum k-mer overlap].

...And, of course, if you allow too many mismatches between the overlapping read parts, you do risk creating merged chimeras. So i like to require at least 15 bp overlap, and allow no more than maximum 5% mismatch between the merged reads (so obviously 0 mismatches at 15bp but up to 5 mismatches at overlaps of 100bp)

rcmueller commented 2 years ago

Thank you for elaborating on this (I almost missed your edit)! During the merge step (via vsearch; default max overlap=10) we only allowed 1 expected error over at least 100 bp length to (hopefully) avoid chimaeras. The idea was indeed to consider the merged reads to be "trusted overlaps" which could then be potentially extended by megahit (meta-sensitive, which starts at k=21). Anyway, we ended up not using the merged reads in the current pipeline (whose processing is pretty far in). But I will come back to read merging and also only allow longer contigs if we also bin MAGs.

voutcn / megahit

Metagenome assembly: counter-intuitive results assembling unmerged vs. merged reads #316