Closed Ahmed-Shibl closed 3 years ago
Hi!
I am hoping to produce and upload an updated NCBI nt database next year, but for now, I can help you build your own database. That way you can also ensure that your wanted genomes are there (as not all genome sequences end up in nt).
Building your own database is fairly straightforward - first download an updated version of the NCBI nt via their ftp site. Then follow these instructions to rename the sequence headers: https://github.com/vrmarcelino/CCMetagen/tree/master/benchmarking/rename_nt
Then index the database with KMA:
kma index -i <your_renamed_database.fna> -o <updated_nt_db> -NI -Sparse TG
And that is done. Let us know if you run into issues.
Hi Devs, would you know an approximate amount of RAM required for kma indexing a recent nt database?
Thanks!
Hi Rafael,
If you can, allow it to use 500Gb. I haven't done it in a while so I won't be able to give you an exact answer. It also varies depending on your KMA version - the latest ones require less memory. If you need to index it with less memory, let us know. Philip - the developer of KMA - will be able to give you more precise info and tips.
Dear vrmarcelino,
Thank you for your prompt reply. Assuming the 500Gb estimative I used the r5.24xlarge AWS instance with 748Gb for indexing a January 2021 nt with 65.715.181 seqs (339Gb FASTA). Unfortunately, it wasn’t enough, as the process died at sequence 40.838.840 after ~48 hrs. Assuming a linear RAM requirement, about 1200 Gb will be necessary to accomplish the indexing. I used kma v1.3.9 (kma index -i /DB/nt_20210105.fasta -o /DB/nt_20210105). Do you have a more precise estimative of when this year the previously mentioned new version will be available for download?
Thanks again.
Hi Rafael,
Sorry for the late reply. For these large databases you need to add the flags -NI -Sparse TG
, which will help reducing the time and memory requirements substantially.
Could you give it a try with these flags and let us know if it works?
No specific dates for launching the latest database yet, sorry.
Hello CCMetagen developers, I'm looking forward to using this software very soon. I wanted to ask if there is a way to get a more updated version of the NCBI nr nt database. I have metagenomic datasets with corals, symbiodinium, and bacterial reads and I know that there are coral and symbiodinium genomes that have been very recently (September 2020) deposited. Thanks