ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

Metagenome assembly recommendations? #78

Closed mikolmogorov closed 4 years ago

mikolmogorov commented 5 years ago

Hi,

Are there any parameter recommendations for running wtdbg2 on metagenomes? Currently with the 2.3 release, I am getting good representation of all bacteria in PacBio HMP mock assembly (https://github.com/PacificBiosciences/DevNet/wiki/Human_Microbiome_Project_MockB_Shotgun). On the other hand, ONT Zymo assembly (https://github.com/LomanLab/mockcommunity) seems to be missing a few species with coverage above the median dataset coverage. I am setting genome size to the total size of all organisms in the mixture - is that right?

Best, Mikhail

ruanjue commented 5 years ago

Hi,

I used to assemble the ONT Zymo with wtdbg2 -t 64 -i Zymo-GridION-LOG-BB-SN.fq.gz -fo dbg -x ont --node-max 1000 -e 2, but just stop there because I am not sure how to evalutate them. I guess --node-max will be more important in meta-assembly. It will be valuable that someone provide a script suitable for metagenomes.

Best, Jue

mikolmogorov commented 5 years ago

Thanks, I'll give it a try. For reference evaluations, I can recommend metaQUAST (http://quast.sourceforge.net/metaquast) - in my case it was very helpful.

Best, Mikhail

ruanjue commented 5 years ago

Thanks for the information.

mikolmogorov commented 5 years ago

Hi,

Just wanted to get back to you with my tests. These parameters indeed improved the coverage of Zymo Even dataset (total assembly size 28Mb -> 55Mb). On the other hand, it seems that it hurt the contiguity: NG50 dropped from 2.7Mb with the default parameters to 614Kb with the custom parameters (G = 25Mb for both statistics).

Best, Mikhail

ruanjue commented 5 years ago

Thanks. It looks that more carefully actions on assembly graph might give a better combination of assembly size and NG50.

BTW, I am trying to develop new graph clean algorithm based more on read paths, if it works better on meta-genomes, I post the results in this thread.

Best, Jue

ahcm commented 5 years ago

That sounds like it might improve difficult regions like repeats too. I would be very interested in testing it. Thanks!

SamStudio8 commented 5 years ago

We've been assembling our ONT data with -K 10000 --max-node 6000 -S1, and varying -p, -e and -L for experimentation. For wtdbg2 v2.4, I've also added -X 6000 -g 62m, which seems to get output very close to the version we used in our preprint. We evaluated the contig quality by generating dotplots of their identity to some recently published corresponding PacBio references, using scripts that can be found in our repository.

ruanjue commented 5 years ago

v2.4 fixed some BUGs in v2.3 and improved the ouput efficiency, I am still working on replacing the old graph clean by fully exploring read paths, will release v2.4 after finish this.