Building nt database - no multithreading at centrifuge-build step

koppk commented 10 months ago

First thanks a lot for your detailed description of how to build the nt database using centrifuge (more detailed than at centrifuge's docs).

However, I cannot succeed with the last step. I got nt.fna nt.map the nt dusted fna etc. But even on a scaleway ubuntu 22.04 instance with 96 cores and 384 G RAM, I cannot get centrifuge-build into using more than just a single thread.

I tried to install centrifuge in any thinkable way, binaries, build from source, the generic ubuntu package (sudo apt-get install centrifuge) and run make THREADS=96 nt (which gave me several 1.2 TB nt.* files) or directly centrifuge-build -p 96 ... but it always ends up taking forever just to start and then find files with gaps and using just one core.

Finally, as I have seen that in many forums: Do you or anybody reading that have a resource for rather current (at least 2020 or later) nt indices (nt.1.cf, nt.2.cf, nt.3.cf). I guess many people struggle with that and not everybody needs p+h+v only but some other mammalian species. I have spent days (and fees) on scaleway/aws instances just to fail at the last step. I have used centrifuge successfully before and really liked it and would love to use recentrifuge afterwards for further analysis but without a recent nt index, that is impossible unfortunately.

Any suggestions?

Thanks a lot.

khyox commented 10 months ago

Hi Katharina, thanks for your kind message and for reaching out about this critical issue of having a recent nt DB for Centrifuge. Please hold on. I'll come back with some useful information.

khyox commented 10 months ago

Thanks for your patience while we discussed this internally.

Yes, we have several recent versions of the nt indexed database for Centrifuge. Due to the exponential growth of the nt database, it is more and more computationally challenging to built it. On the hardware side, a supercomputing server is required with shared memory in the order of magnitude of the terabyte. On the software side, we needed to introduce some changes due to memory constrains despite the use of an HPC server with high memory per node. Even with those resources, you need several days of computing time, with the entire build pipeline easily taking beyond a week of elapsed time. We are finishing a pre-print where we will release our databases along with our results of comparisons between them and the detailed pipeline.

The problem is that, independently of the build process, even for classifying samples with Centrifuge using recent nt databases, you need a computer with a lot of memory. Given your machine with 384 gigabytes of memory, the most recent indexed nt DB that you would be able to work with would be a database I built in the fall of 2021. If this may work for you, we are happy to provide that db to you. We may also explore additional ways to collaborate. If you are interested and would like to discuss further, please let me know a good email for you and I will include in the conversation other team members involved in these and related research efforts. Thanks!

koppk commented 10 months ago

Dear Jose Manuel Marti,

Thanks a lot for your kind reply. Definitely interested in

1) your nt database of 2021

2) collaboration on metagenomics sample classification (details in email).

Best email: the one senidng this reply.

Looking forward very much to hearing more from you.

Kind regards,

Katharina

Gesendet: Sonntag, 28. Januar 2024 um 17:50 Uhr Von: "Jose Manuel Martí" @.> An: "khyox/recentrifuge" @.> Cc: "Katharina Kopp" @.>, "Author" @.> Betreff: Re: [khyox/recentrifuge] Building nt database - no multithreading at centrifuge-build step (Issue #51)

Thanks for your patience while we discussed this internally.

Yes, we have several recent versions of the nt indexed database for Centrifuge. Due to the exponential growth of the nt database, it is more and more computationally challenging to built it. On the hardware side, a supercomputing server is required with shared memory in the order of magnitude of the terabyte. On the software side, we needed to introduce some changes due to memory constrains despite the use of an HPC server with high memory per node. Even with those resources, you need several days of computing time, with the entire build pipeline easily taking beyond a week of elapsed time. We are finishing a pre-print where we will release our databases along with our results of comparisons between them and the detailed pipeline.

The problem is that, independently of the build process, even for classifying samples with Centrifuge using recent nt databases, you need a computer with a lot of memory. Given your machine with 384 gigabytes of memory, the most recent indexed nt DB that you would be able to work with would be a database I built in the fall of 2021. If this may work for you, we are happy to provide that db to you. We may also explore additional ways to collaborate. If you are interested and would like to discuss further, please let me know a good email for you and I will include in the conversation other team members involved in these and related research efforts. Thanks!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

khyox commented 10 months ago

Thanks Katharina!

Unfortunately, as you can see above, GitHub is masking the email addresses when replying by email so I cannot see your email address. I guess you'll have to use GitHub web to that your address is not masked, or alternatively just send an email to my email address at jse dot mnl at gmail dot com.

As a separate node, because of your memory limits, it looks like you are using "Production-Optimized" instance type on Scaleway. If you were to use the Workload-Optimized it seems you would be able to have 512 gigabytes of main memory. With that upgrade, you would be able to run Centrifuge with more recent (and improved) versions of the nt DB that I have been building.

khyox commented 5 months ago

FYI, our pre-print accompanying the release of a new Centrifuge nt database is online now: Addressing the dynamic nature of reference data: a new nt database for robust metagenomic classification. Any feedback will be welcome!

khyox / recentrifuge

Building nt database - no multithreading at centrifuge-build step #51