rrwick / Metagenomics-Index-Correction

GNU General Public License v3.0
77 stars 9 forks source link

Merging GTDB database with viral Refseq sequences: Kraken2 #10

Open HitMonk opened 4 years ago

HitMonk commented 4 years ago

Hello everyone, I building GTDB databases to be used along with Kraken2. I was wondering if it was possible to add viral refseq sequences while building kraken2 databases? GTDB consists of archaeal and bacterial genomes only. So, it would be ideal to merge this along with viral refseq sequences. Im not sure how to go about doing this. Please let me know if you have any suggestions.

chassenr commented 4 years ago

Hi @HitMonk , I have hit the same snag with GTDB. Right now I am working on a snakemake workflow to merge GTDB with the various refseq partitions. If you want I can keep you updated.

Cheers, Christiane

HitMonk commented 4 years ago

Hello @chassenr, That would be really helpful! i have access to multiple datasets here and can contribute by testing different builds if you would like... I am also trying to approach it using taxonomies generated with taxonkit. I will post an update if it works. Looking forward to using your workflow :)

choon-sim commented 4 years ago

Hi @chassenr and @HitMonk , will you keep me posted too? I am also working on making a Kraken2 database with GTDB genomes and Refseq viral/ fungi genomes.

chassenr commented 4 years ago

Hi @HitMonk and @choon-sim , sure! I am almost done with the workflow (only running some validation steps at the moment). With this workflow you will then be able to build a joined kraken2 database for GTDB and any of the NCBI partitions (fungi, plant, protozoa, invertebrate, vertebrate, viral) that you choose (at individually defined dereplication thresholds). I hope to have it out by the end of the month if you can still be a little patient. I will link the repo here once I am done. Already now, I want to give a big thanks to the maintainers of the Metagenomics Index Correction repo: their scripts have been incredibly useful!

Cheers, Christiane

chassenr commented 3 years ago

Hi @HitMonk and @choon-sim , I am really sorry that I was not able to get back to you sooner. The workflow I am developing is becoming slightly more complicated than initially anticipated. To give you a short idea of how to (quick and dirty) combine any refseq genome with GTDB in a joint kraken2 database, here are a few suggestions:

I hope this helps and is still relevant for you (again, sorry for the late reply). Please let me know if you have further questions. I will let you know once my workflow is finished.

Cheers, Christiane

davve2 commented 3 years ago

@HitMonk @chassenr @choon-sim

I have developed a script that allows you to merge databases from many different sources and keep/remove annotations to your liking. The tutorial will create a database with NCBI taxa as base but where the tree structure on Bacteria and Archaea respectively will be replaced to follow GTDB taxonomy.

If it is only viral genomes (on top of GTDB is of interest, use a genome directory with only viral genomes when creating the NCBI database!

https://github.com/FOI-Bioinformatics/flextaxd

HitMonk commented 3 years ago

Hello @davve2 I havent yet used this but it looks perfect! So far i was using a workaround pipeline using the scripts from Metagenomic Index Corrector. do you know if i can use this for 16s sequences too? Im trying to build GTDB 16s database that is compatible with kraken, but so far nothing seems to be able to do that.

choon-sim commented 3 years ago

Hi, thanks a lot for developing the script and telling me about it! I will give it a try and let you know.

Choon

On Thu, Oct 29, 2020 at 7:45 PM davve2 notifications@github.com wrote:

I have developed a script that allows you to merge databases from many different sources and keep/remove annotations to your liking. The tutorial will create a database with NCBI genomes where the tree structure on Bacteria and Archaea respectively will be replaced to follow GTDB taxonomy.

If it is only viral genomes (on top of GTDB is of interest, use a genome directory with only viral genomes when creating the NCBI database!

https://github.com/FOI-Bioinformatics/flextaxd

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rrwick/Metagenomics-Index-Correction/issues/10#issuecomment-718700234, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMO6NM4OMQQTTQMPDGNDRKTSNFIVXANCNFSM4OOVVHYQ .

HitMonk commented 3 years ago

@choon-sim Hello, also if you would to continue using MGI, here is the approach I use. It has the steps and links to scripts used.

davve2 commented 3 years ago

Hello @davve2 I havent yet used this but it looks perfect! So far i was using a workaround pipeline using the scripts from Metagenomic Index Corrector. do you know if i can use this for 16s sequences too? Im trying to build GTDB 16s database that is compatible with kraken, but so far nothing seems to be able to do that.

Let me know if you do try to build a 16s database! Also if you have problems please raise an issue with some examples!

I have not tried creating any 16s databases myself yet, but I don´t see why it wouldn´t work. The seq to taxid annotation only uses either a GCF/GCA number or a complete filename (with or without .f*a) I can see how some of the scripts from Metagenomic Index Corrector can be useful for 16s in the preprocess! But from sequences (and a seq2taxid table) to kraken2 FlexTaxD should be able to do the job!

davve2 commented 3 years ago

Hi, thanks a lot for developing the script and telling me about it! I will give it a try and let you know. Choon On Thu, Oct 29, 2020 at 7:45 PM davve2 @.***> wrote: I have developed a script that allows you to merge databases from many different sources and keep/remove annotations to your liking. The tutorial will create a database with NCBI genomes where the tree structure on Bacteria and Archaea respectively will be replaced to follow GTDB taxonomy. If it is only viral genomes (on top of GTDB is of interest, use a genome directory with only viral genomes when creating the NCBI database! https://github.com/FOI-Bioinformatics/flextaxd — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#10 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMO6NM4OMQQTTQMPDGNDRKTSNFIVXANCNFSM4OOVVHYQ .

Great thanks, if you run into problems please raise an issue!

HitMonk commented 3 years ago

@davve2 ill have an update for you in a few weeks :)