Question for genomesize generated by meta-Flye

mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs

Other

789 stars 167 forks source link

Question for genomesize generated by meta-Flye #227

Closed BigNianNGS closed 4 years ago

BigNianNGS commented 4 years ago

Hello,

Thank you for metaFlye assembly module development and I have two questions about that.

Q1: Why the difference of genome sizes for Zymo Log and Zymo Even is so big ?

I have tested metaflye for a very complex gut contents sample, which may be more like Zymo log. The genomesize generated from ONT sequences was 2.5Gbp by using 2kb min-overlap. And the genomesize from NGS sequences was 5Gbp by using megahit software (--presets meta-large). Both ONT and NGS were sequeced by using the same DNA and generate 60Gbp rawdata. Q2: Why the difference of genome sizes of ONT and NGS is so big for complex metagenomic sample? Can you give me some advice?

Thank you in advance !

mikolmogorov commented 4 years ago

Hi,

Q1: These datasets have very different coverages. Some bacteria in ZymoLog had coverage below 1x, thus impossible to assemble.

Q2: Not surprising at all for real metagenomes. Different technologies might be picking up different parts of the metagenome. Plus, Illumina assemblies typically output contigs with very low read coverage (like 1x or 2x), which metaFlye does not. It makes sense to set up a reasonable length and coverage cutoffs for contigs before making the comparison.

BigNianNGS commented 4 years ago

Hi, Q1: These datasets have very different coverages. Some bacteria in ZymoLog had coverage below 1x, thus impossible to assemble. Q2: Not surprising at all for real metagenomes. Different technologies might be picking up different parts of the metagenome. Plus, Illumina assemblies typically output contigs with very low read coverage (like 1x or 2x), which metaFlye does not. It makes sense to set up a reasonable length and coverage cutoffs for contigs before making the comparison.

Thank you for advice. Can I use 'concensus.fasta' instead of the final 'assembly.fasta' for downstream analysis ? The genomesize of 'assembly.fasta' turned to be much smaller than the 'concensus.fasta' , for example 2.3G vs 3.0G. And I have checked that most of the lost regions were not repeat. That is also strange ...

BigNianNGS commented 4 years ago

Another question is if I can chose the assemblies from NGS and the '10-consensus/concensus.fasta' from ONT (polished by racon and medaka, and even pilon) and combine them by using '--subassemblies' parameter. I think this will take advantages of both technologies for better results.

mikolmogorov commented 4 years ago

@BigNianNGS no, the intermediate sequences should not be used for the analysis, as they might contain duplications and misassemblies.