Databases-related inquiry

vrmarcelino / CCMetagen

Microbiome classification pipeline

GNU General Public License v3.0

63 stars 19 forks source link

Databases-related inquiry #22

Closed Ahmed-Shibl closed 3 years ago

Ahmed-Shibl commented 3 years ago

Hello CCMetagen developers, I'm looking forward to using this software very soon. I wanted to ask if there is a way to get a more updated version of the NCBI nr nt database. I have metagenomic datasets with corals, symbiodinium, and bacterial reads and I know that there are coral and symbiodinium genomes that have been very recently (September 2020) deposited. Thanks

vrmarcelino commented 3 years ago

Hi!

I am hoping to produce and upload an updated NCBI nt database next year, but for now, I can help you build your own database. That way you can also ensure that your wanted genomes are there (as not all genome sequences end up in nt).

Building your own database is fairly straightforward - first download an updated version of the NCBI nt via their ftp site. Then follow these instructions to rename the sequence headers: https://github.com/vrmarcelino/CCMetagen/tree/master/benchmarking/rename_nt

Then index the database with KMA: kma index -i <your_renamed_database.fna> -o <updated_nt_db> -NI -Sparse TG

And that is done. Let us know if you run into issues.

rafaelmguedes commented 3 years ago

Hi Devs, would you know an approximate amount of RAM required for kma indexing a recent nt database?

Thanks!

vrmarcelino commented 3 years ago

Hi Rafael,

If you can, allow it to use 500Gb. I haven't done it in a while so I won't be able to give you an exact answer. It also varies depending on your KMA version - the latest ones require less memory. If you need to index it with less memory, let us know. Philip - the developer of KMA - will be able to give you more precise info and tips.

rafaelmguedes commented 3 years ago

Dear vrmarcelino,

Thank you for your prompt reply. Assuming the 500Gb estimative I used the r5.24xlarge AWS instance with 748Gb for indexing a January 2021 nt with 65.715.181 seqs (339Gb FASTA). Unfortunately, it wasn’t enough, as the process died at sequence 40.838.840 after ~48 hrs. Assuming a linear RAM requirement, about 1200 Gb will be necessary to accomplish the indexing. I used kma v1.3.9 (kma index -i /DB/nt_20210105.fasta -o /DB/nt_20210105). Do you have a more precise estimative of when this year the previously mentioned new version will be available for download?

Thanks again.

vrmarcelino commented 3 years ago

Hi Rafael,

Sorry for the late reply. For these large databases you need to add the flags -NI -Sparse TG, which will help reducing the time and memory requirements substantially. Could you give it a try with these flags and let us know if it works? No specific dates for launching the latest database yet, sorry.