Merging GTDB and viral refseq sequences

HitMonk commented 4 years ago

hello there! Im sorry if this is a stupid question but I am trying to build a database comprising of both archaeal and bacterial sequences from GTDB and viral refseq sequences. I would also like to use the same database downstream with bracken. Im not sure if this is possible? but if anyone has tried this, i would appreciate learning how to do this.

Looking forward to hearing from you!

nick-youngblut commented 4 years ago

It should be possible. You would need to get the viral genome data in the format required for Struo (eg., download all of the genome fasta files & get the taxonomies).

HitMonk commented 4 years ago

The taxonomies file is what im unsure about. Could you give me a couple of pointers to build a compatible taxonomy file?

nick-youngblut commented 4 years ago

You can use the entire NCBI taxdump, as long as it has the viral taxonomy, and then you'd have to merge it with the GTDB taxdump files. I don't know of a tool for merging taxdumps. If you can't find a tool, then I could add a script to https://github.com/nick-youngblut/gtdb_to_taxdump.

After that, you just format the input as shown in the Struo docs.

HitMonk commented 4 years ago

Yeah, the merging of the taxonomies is what is kinda hard and I havent come across anything that can do it. GTDB is a fantastic database but i think its held back a little since it is limited to just Bacteria and archaea. If there were a script to merge taxonomies it would greatly improve microbiome classifications. I do program a bit but am completely lost on manipulating taxdump files and making sure all taxonomies: GTDB and NCBI are in the same format. If you would be willing to write a script then it would be really helpful. If not could you please provide me with any reference on how to manipulate taxdump files -- if there is even such a resource.

nick-youngblut commented 4 years ago

Actually, I forgot that gtdb_to_taxdump already merges taxdump files. This is necessary for the GTDB, since the archaea and bacteria taxonomies are separate. You could just provide the GTDB-bacteria, GTDB-archaea, and NCBI-virus taxonomy files, and the script should merge all of them. You will need to get the taxonomies for each virus genome and format them in the same way as used for GTDB. For example:

RS_GCF_000979745.1  d__Archaea;p__Halobacterota;c__Methanosarcinia;o__Methanosarcinales;f__Methanosarcinaceae;g__Methanosarcina;s__Methanosarcina mazei
RS_GCF_000980175.1  d__Archaea;p__Halobacterota;c__Methanosarcinia;o__Methanosarcinales;f__Methanosarcinaceae;g__Methanosarcina;s__Methanosarcina mazei
RS_GCF_001647085.1  d__Archaea;p__Euryarchaeota;c__Thermococci;o__Thermococcales;f__Thermococcaceae;g__Thermococcus;s__Thermococcus piezophilus
GB_GCA_002838935.1  d__Archaea;p__Thermoplasmatota;c__E2;o__UBA9212;f__GCA-002838935;g__GCA-002838935;s__GCA-002838935 sp002838935

Note that the taxIDs will no longer match ANYTHING in the NCBI taxonomy! gtdb_to_taxdump re-numbers all of the taxIDs.

HitMonk commented 4 years ago

thank you! Let me try this and get back to you on the results!

HitMonk commented 4 years ago

Apologies for reopening this thread. But i have some questions. I downloaded the GTDB databases using the following: Rscript /exports/watson/Prateek/apps/Struo/util_scripts/GTDB_metadata_filter.R -o gtdb-r89_bac-arc.tsv https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/bac120_metadata_r89.tsv https://data.ace.uq.edu.au/public/gtdb/data/releases/release89/89.0/ar122_metadata_r89.tsv

Rscript /exports/watson/Prateek/apps/Struo/util_scripts/genome_download.R -o genomes -p 8 gtdb-r89_bac-arc.tsv > genomes.txt

I then downloaded Kraken taxonomy and library files using: Download taxonomy and genomes for kraken2 kraken2-build --download-taxonomy --db kraken+gtdb

Download libraries to use with k2 kraken2-build --download-library viral --db kraken+gtdb; kraken2-build --download-library fungi --db kraken+gtdb; kraken2-build --download-library protozoa --db kraken+gtdb

However, im not sure how to get the taxonomies in the format you mentioned? Is there a place i can download it from? The taxonomies downloaded by kraken2 are not in that format. The taxdump file itself consists of a bunch of citations for the organism. So, im not exactly sure how to go about merging the genomes from kraken2 with GTDB and where I can download taxonomies in the format that you have mentioned.

nick-youngblut commented 4 years ago

You'd have to get the viral taxonomy in the same format as GTDB (see the GTDB metadata). I'm not sure if kraken2 includes taxonomy in that format. If it's just in taxdump format, you could reconfigure the gtdb_to_taxdump.py script to output a tab-delim taxonomy, but that would take a bit of work. You could also take the NCBI taxIDs from the kraken2 database and get the entire taxonomy via taxonkit.

HitMonk commented 4 years ago

Regarding the kraken2 taxdump, below is how the first few lines look. It doesnt look anything like the format required by GTDB

7 | Equine herpesvirus | 0 | 819656 | | | | 8 | Yabuuchi E et al. (1990) | 0 | 2111872 | | Yabuuchi, E., Yano, I., Oyaizu, H., Hashimoto, Y., Ezaki, T., and Yamamoto, H. \"Proposals of Sphingomonas paucimobilis gen. nov. and comb. nov., Sphingomonas parapaucimobilis sp. nov., Sphingomonas yanoikuyae sp. nov., Sphingomonas adhaesiva sp. nov., Sphingomonas capsulata ``comb. nov., and two genospecies of the genus Sphingomonas.\" Microbiol. Immunol. (1990) 34:99-119. | 13687 13688 13689 13690 28212 28213 | 9 | Dennis PJ et al. (1993) | 0 | 8494743 | | Dennis, P.J., Brenner, D.J., Thacker, W.L., Wait, R., Vesey, G., Steigerwalt, A.G., and Benson, R.F. \"Five new Legionella species isolated from water.\" Int. J. Syst. Bacteriol. (1993) 43:329-337. | 45065 45068 45070 45072 45076 |

I will try the Taxonkit as you mentioned and pull taxonomy directly from NCBI.

Also, another rather strange thing happened that i missed writing about. While I was downloading the files for GTDB, folders for bacterial and archaeal genomes were created but no fasta files were downloaded. I also got an error message:

Reading table: gtdb-r89_bac-arc.tsv Number of rows: 24065 Number of rows after filtering: 23452 Writing accessions to: /exports/watson/Prateek/dbs/kraken2_struo/genomes/accession.txt Running cmd: ncbi-genome-download -F fasta -o /exports/watson/Prateek/dbs/kraken2_struo/genomes -p 8 -r 3 -A /exports/watson/Prateek/dbs/kraken2_struo/genomes/accession.txt -s genbank "archaea,bacteria" ERROR: Download from NCBI failed: ConnectionError(MaxRetryError("HTTPSConnectionPool(host='ftp.ncbi.nlm.nih.gov', port=443): Max retries exceeded with url: /genomes/all/GCA/000/477/555/GCA_000477555.1_LeRu1.0/md5checksums.txt (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1a8e8d6710>: Failed to establish a new connection: [Errno 101] Network is unreachable',))",),) ERROR: Downloading from NCBI failed due to a connection error, retrying. Retries so far: 1 ERROR: Download from NCBI failed: ConnectionError(MaxRetryError("HTTPSConnectionPool(host='ftp.ncbi.nlm.nih.gov', port=443): Max retries exceeded with url: /genomes/all/GCA/001/579/665/GCA_001579665.1_ASM157966v1/md5checksums.txt (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1a8da0ba20>: Failed to establish a new connection: [Errno 101] Network is unreachable',))",),) ERROR: Downloading from NCBI failed due to a connection error, retrying. Retries so far: 2 ERROR: Download from NCBI failed: ConnectionError(ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')),) ERROR: Downloading from NCBI failed due to a connection error, retrying. Retries so far: 3 ERROR: Download from NCBI failed: ConnectionError(MaxRetryError("HTTPSConnectionPool(host='ftp.ncbi.nlm.nih.gov', port=443): Max retries exceeded with url: /genomes/all/GCA/000/144/915/GCA_000144915.1_ASM14491v1/md5checksums.txt (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f1a8c57b7f0>: Failed to establish a new connection: [Errno 101] Network is unreachable',))",),) Number of fasta files found: 0 Adding file paths to the input table Warning message: Column ``ncbi_genbank_assembly_accession/accession joining factor and character vector, coercing into character vector Number of rows in the output: 23452 Number of rows with missing file paths: 23452

Now, this isnt much of an issue, as im sure i can directly download all the genomes from the GTDB website. Just uncertain what went wrong and if i can do anything about it?

Thank you for your assistance, I will get back to you with updates using taxonkit.

nick-youngblut commented 4 years ago

Regarding the kraken2 taxdump, below is how the first few lines look. It doesnt look anything like the format required by GTDB

A taxdump is not the format that you need, but taxonkit should help.

Also, another rather strange thing happened that i missed writing about. While I was downloading the files for GTDB, folders for bacterial and archaeal genomes were created but no fasta files were downloaded. I also got an error message:

This sounds like a connection issue. Try again at a different time and hopefully it will work.

HitMonk commented 4 years ago

ah, that makes sense. Thank you.

andrewjmc commented 4 years ago

How did you get on @HitMonk ? I'm looking to build human + fungi + GTDB bacterial/archaeal kraken2 database. I'm kind of hoping you'll have got it working and have it down to a simple set of steps!

HitMonk commented 4 years ago

Hi @andrewjmc, I should have a complete step by step walkthrough in about a 8-10 days... I know of atleast one other person who is also working on this and one way or another you should have steps to follow within the next couple of weeks. Hopefully that is acceptable :)

andrewjmc commented 4 years ago

That will be perfect, thanks @HitMonk

andrewjmc commented 4 years ago

No pressure @HitMonk but would be great to know how you've got on!

HitMonk commented 4 years ago

Hey @andrewjmc, Apologies for the late reply. I think i got it to work. I just want to make sure all the taxonomies are correct and need to write up a log of how to do it. I can send you a short write up in a few days? I hope thats fine

andrewjmc commented 4 years ago

You have no obligation to me! Whenever you have some advice/walkthrough I will be very grateful.

andrewjmc commented 3 years ago

I've come back to this, not only for making database for kraken2, but also for ganon (attracted by the ability to progressively augment database). Rather than Struo, I'm trying https://github.com/rrwick/Metagenomics-Index-Correction for producing taxdump files from GTDB. I've tried taxonkit for producing lineages. It gets me part of the way there but the taxonomies do not including rank specification (i.e. k__Kingdom;...). Once I can get this I should be sorted. Any ideas? @HitMonk did you solve?

HitMonk commented 3 years ago

Hey there @andrewjmc, I think i got it mostly working. but had to abandon it because it doesnt work for eukaryotic genomes (was using aglae). I have attached a couple of files, one file is a list of commands that talk about the taxonkit commands. the other is an R script to merge the files into the Sample description files. These should generate all the required files for Struo. I think there maybe a couple of steps that im forgetting. Basically you will have sample description files made for each lineage then you use the R script to merge them all. Do let me know if something doesnt make sense. I was planning on making a proper post with descriptions sometime soon. (i hope i get there ><). BTW, im using another method to use the eukaryotic genomes too, the metagenomic index corrector too! I will update if that works.

commands.txt conv_script_R.txt

andrewjmc commented 3 years ago

Really useful to see how you put the ranks into the lineages! Thanks

leylabmpi / Struo

Merging GTDB and viral refseq sequences #4